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THE SPECTRUM OF KERNEL RANDOM MATRICES 
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University of California, Berkeley 

We place ourselves in the setting of high-dimensional statistical 
inference where the number of variables p in a dataset of interest is 
of the same order of magnitude as the number of observations n. 

We consider the spectrum of certain kernel random matrices, in 
particular nxn matrices whose (i, j)th entry is f(X[Xj/p) or — 
Xj || 2 /p) where p is the dimension of the data, and Xi are independent 
data vectors. Here / is assumed to be a locally smooth function. 

The study is motivated by questions arising in statistics and com- 
puter science where these matrices are used to perform, among other 
things, nonlinear versions of principal component analysis. Surpris- 
ingly, we show that in high-dimensions, and for the models we ana- 
lyze, the problem becomes essentially linear — which is at odds with 
heuristics sometimes used to justify the usage of these methods. The 
analysis also highlights certain peculiarities of models widely stud- 
ied in random matrix theory and raises some questions about their 
relevance as tools to model high-dimensional data encountered in 
practice. 

1. Introduction. Recent years has seen newfound theoretical interest in 
the properties of large- dimensional sample covariance matrices. With the 
increase in the size and dimensionality of datasets to be analyzed, questions 
have been raised about the practical relevance of information derived from 
classical asymptotic results concerning spectral properties of sample covari- 
ance matrices. To address these concerns, one line of analysis has been the 
consideration of asymptotics where both the sample size n and the num- 
ber of variables p in the dataset go to infinity jointly while assuming, for 
instance, that p/n had a limit. 
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This type of questions concerning the spectral properties of large-dimen- 
sional matrices have been and are being addressed in variety of fields, from 
physics to various areas of mathematics. While the topic is classical, with the 
seminal contribution [43] dating back from the 1950s, there has been renewed 
and vigorous interest in the study of large-dimensional random matrices in 
the last decade or so. This has led to new insights and the appearance of 
new "canonical" distributions [38], new tools (see [41]) and, in Statistics, a 
sense that one needs to exert caution with familiar techniques of multivariate 
analysis when the dimension of the data gets large and the sample size is of 
the same order of magnitude as that dimension. 

So far in Statistics, this line of work has been concerned mostly with 
the properties of sample covariance matrices. In a seminal paper, Marcenko 
and Pastur [30] showed a result that, from a statistical standpoint, may be 
interpreted as saying, roughly, that asymptotically, the histogram the eigen- 
values of a sample (i.e., random) covariance matrix is (asymptotically) a 
deterministic nonlinear deformation of the histogram of the eigenvalues of 
the population covariance matrix. Remarkably, they managed to character- 
ize this deformation for fairly general population covariances. Their result 
was shown in great generality and introduced new tools to the field including 
one that has become ubiquitous, the Stieltjes transform of a distribution. In 
its best-known form, their result says that when the population covariance 
is identity, and hence all the population eigenvalues are equal to 1, in the 
limit the sample eigenvalues are split, and if p < n, they are spread between 
[(1 — ^/p/n) 2 , (1 + \/p/n) 2 ] according to a fully explicit density known now as 
the density of the Marcenko-Pastur law. Their result was later re-discovered 
independently in [42] (under slightly weaker conditions) and generalized to 
the case of nondiagonal covariance matrices in [37] under some particular 
distributional assumptions which we discuss later in the paper. 

On the other hand, recent developments have been concerned with fine 
properties of the largest eigenvalue of random matrices which became amenable 
to analysis after mathematical breakthroughs which happened in the 1990s 
(see [38, 39] and [40]). Classical statistical work on joint distribution of 
eigenvalues of sample covariance matrices (see [1] for a good reference) then 
became usable for analysis in high-dimensions. In particular, in the case 
of gaussian distributions, with Id covariance, it was shown in [27] and [16] 
that the largest eigenvalue of the sample covariance matrix is Tracy- Widom 
distributed. More recent progress [17] managed to carry out the analysis 
for essentially general population covariance. On the other hand, models for 
which the population covariance has a few separated eigenvalues have also 
been of interest (see, for instance, [8] and [31]). Beside the particulars of the 
different type of fluctuations that can be encountered (Tracy- Widom, Gaus- 
sian or other), researchers have been able to precisely localize these largest 
eigenvalues. One interesting aspect of those results is the fact that in the 
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high-dimensional setting of interest to us, the largest eigenvalues are always 
positively biased, with the bias being sometime large. (We also note that 
in the case of i.i.d. data — which naturally is less interesting in statistics — 
results on the localization of the largest eigenvalue have been available for 
quite some time now, after the works [21] and [45], to cite a few.) This is 
naturally in sharp contrast to classical results of multivariate analysis which 
show -y/n-consistency of all sample eigenvalues; though the possibility of bias 
is a simple consequence of Jensen's inequality. 

On the other hand, there has been much less theoretical work on kernel 
random matrices. By this term, we mean matrices with entry of the 
form, 

M i>j = k{X it X j ), 

where M is an n x n matrix, Xi is a p-dimensional data vector and A; is a 
function of two variables, often called a kernel, that may depend on n. Com- 
mon choices of kernels include, for instance, k(Xi,Xj) = f(\\Xi — Xj\\ 2 /t), 
where / is a function and t is a scalar, or k(Xi,Xj) = f(X[Xj/t). For the 
function /, common choices include f{x) = exp(— x), f{x) = exp(— x a ), for a 
certain scalar a, f(x) = (1 + x) a , or f(x) = tanh(a + bx), where b is a scalar. 
We refer the reader to [33], Chapter 4, or [44] for more examples. 

In particular, we are not aware of any work in the setting of high-dimensional 
data analysis, where p grows with n. However, given the practical success 
and flexibility of these methods (we refer to [35] for an introduction), it is 
natural to try to investigate theoretically their properties. Further, as illus- 
trated in the data analytic part of [44], ann/p boundedness assumption is 
not unrealistic as far as applications of kernel methods are concerned. One 
aim of the present paper is to shed some theoretical light on the properties 
of these kernel random matrices and to do so in relatively wide generality. 
We note that the choice of renormalization that we make below is motivated 
in part by the arguments of [44] and their practical choices of kernels for 
data of varying dimensions. 

Existing theory on kernel random matrices (see, for instance, the inter- 
esting [28]) for fixed-dimensional input data predicts that the eigenvalues 
of kernel random matrices behave — at least for the largest ones — like the 
eigenvalues of the corresponding operator on L 2 (dP), if the data is i.i.d. 
with probability distribution P. To be more precise, if Xi is a sequence of 
i.i.d. random variables with distribution P, under regularity conditions on 
the kernel k(x,y), it was shown in [28] that, for any index I, the /th largest 
eigenvalue of the kernel matrix M with entries, 



M lJ = -k(X l ,X j ), 
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converges to the Zth largest eigenvalue of the operator K defined as 



These insights have also been derived through more heuristic but nonetheless 
enlightening arguments in, for instance, [44]. Further, more precise fluctua- 
tion results are also given in [28]. We also note interesting work on Laplacian 
eigenmaps (see, e.g., [9]) where, among other things, results have been ob- 
tained showing convergence of eigenvalues and eigenvectors of certain Lapla- 
cian random matrices (which are quite closely connected to kernel random 
matrices) computed from data sampled from a manifold, to corresponding 
quantities for the Laplace-Beltrami operator on the manifold. 

These results are in turn used in the literature to explain the behavior 
of nonlinear versions of standard procedures of multivariate statistics, such 
as Principal Component Analysis (PCA), Canonical Correlation Analysis 
(CCA) or Independent Component Analysis (CCA). We refer the reader 
to [36] for an introduction to kernel-PC A, and to [2] for an introduction to 
kernel-CCA and kernel-ICA. At the heart of these techniques are the spectral 
properties of kernel random matrices. Because these techniques are used in 
bioinformatics, a field where large datasets are common and becoming the 
norm, it is natural to ask what can be said about these spectral properties 
for high-dimensional data. 

We show that for the models we analyze (ICA-type models and general- 
izations that go beyond the linear setting of ICA), kernel random matrices 
essentially behave like sample covariance matrices and hence their eigen- 
values suffer from the same bias problems that affect sample covariance 
matrices in high-dimensions. In particular, if one were to try to apply the 
heuristics of [44] which were developed for low- dimensional problems, to the 
high-dimensional case, the predictions would be quite wildly wrong. (A sim- 
ple example is provided by the Gaussian kernel with i.i.d. Gaussian data 
where the computations can be done completely explicitly as explained in 
[44].) We also note that the scaling we use is different from the one used 
in low dimensions, where the matrices are scaled by 1/n. This is because 
the high-dimensional problem would be completely degenerate if we used 
this normalization in our setting. However, our results still give information 
about the problem when it is scaled by 1/n. 

From a random matrix point of view, our study is connected to the study 
of Euclidean random matrices and distance matrices which is of some interest 
in, for instance, Physics. We refer to [11] and [12] for work in this direction 
in the low (or fixed) dimensional setting. We also note that at the level of 
generality we place ourselves in, the random matrices we study do not seem 
to be amenable to study through the classical tools of random matrix theory. 
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Hence, beside their obvious statistical interest, they are also interesting on 
purely mathematical grounds. 

We now turn to the gist of our paper, which will show that high-dimensional 
kernel random matrices behave spectrally essentially like matrices closely 
connected to sample covariance matrices. We will get two types of results: 
in Theorems 2.1 and 2.2, we get a strong approximation result (in operator 
norm) for standard models (ICA-like) studied in random matrix theory. In 
Theorems 2.3 and 2.4, we characterize the limiting spectral distribution of 
our kernel random matrices, for a wider class of data distributions. In Sec- 
tion 2, we also state clearly the consequences of our theorems and review 
the relevant theory of high-dimensional sample covariance matrices. From a 
technical standpoint, we adopt a point of view centered on the concentra- 
tion of measure phenomenon, as exposed for instance in [29] as it provides 
a unified way to treat the two types of results we are interested in. Finally, 
we discuss in our (self-contained) conclusion (Section 3), the consequences 
of our results and in particular some possible limitations of "standard" ran- 
dom matrix models as tools to model data encountered in practice focusing 
on geometric properties of datasets drawn according to those models. As 
explained in more details there, vectors drawn according to these standard 
random matrix models essentially live close to spheres and are almost or- 
thogonal to one another, a property that may or may not be present in 
datasets to be analyzed and can be seen as a key to many classical and less 
classical random matrix results (see also [19]). 

2. Spectrum of kernel random matrices. Kernel random matrices do not 
seem to be amenable to analysis through the usual tools of random matrix 
theory. In particular, for general /, it seems difficult to carry out either a 
method of moments proof, or a Stieltjes transform proof, or a proof that 
relies on knowing the density of the eigenvalues of the matrix. 

Hence, we take an indirect approach. Our strategy is to find approxi- 
mations of the kernel random matrix that have two properties. First, the 
approximation matrix is analyzable or has already been analyzed in random 
matrix theory. Second, the quality of the approximation is good enough that 
spectral properties of the approximating matrix can be shown to carry over 
to the kernel matrix. 

The strategy in the first two theorems is to derive an operator norm "con- 
sistent" approximation of our kernel matrix. In other words, if we call M 
our kernel matrix, we will find K such that \\\M — K\\\2 — > 0, as n and p 
tend to oo. Note that both M and K are real symmetric (and hence Her- 
mitian) here. We explain after the statement of Theorem 2.1 why operator 
norm consistency is a desirable property. But let us say that in a nutshell, 
it implies consistency for each individual eigenvalue as well as eigenspaces 
corresponding to separated eigenvalues. 
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For the second set of theorems (Theorems 2.3 and 2.4), we will relax 
the distributional assumptions made on the data, but, at the expense of 
the precision of the results we will obtain, we will characterize the limiting 
spectral distribution of our kernel random matrices. 

Our theorems below show that kernel random matrices can be well ap- 
proximated by matrices that are closely connected to large-dimensional co- 
variance matrices. The spectral properties of those matrices have been the 
subject of a significant amount of work in recent and less recent years, and 
hence this knowledge, or at least part of it, can be transferred to kernel 
random matrices. In particular, we refer the reader to [4, 5, 8, 17, 19, 21, 
27, 30, 31, 37, 42] and [45] for some of the most statistically relevant results 
in this area. We review some of them now. 

2.1. Some results on large- dimensional sample covariance matrices. Since 
our main theorems are approximating theorems, we first wish to state some 
of the properties of the objects we will use to approximate kernel random 
matrices. In what follows, we consider annxp data matrix, with, say p/n 
having a finite nonzero limit. Most of the results that have been obtained are 
of two types: either they are so-called "bulk" results and concern essentially 
the spectral distribution (or loosely speaking the histogram of eigenvalues) 
of the random matrices of interest; or they concern the localization and 
fluctuation behavior of extreme eigenvalues of these random matrices. 

2.1.1. Spectral distribution results. An object of interest in random ma- 
trix theory is the spectral distribution of random matrices. Let us call l{ the 
decreasingly ordered eigenvalues of our random matrix, and let us assume we 
are working with an n x n matrix, M n . The empirical spectral distribution 
of a n x n matrix is the probability measure which puts mass 1/n at each 
of its eigenvalues. In other words, if we call F n this probability measure, we 
have 

n 

dF n (x) = -Y t Si i (x). 

Note that the histogram of eigenvalues represents an integrated version of 
this measure. 

For random matrices, this measure F n is naturally a random measure. A 
key result in the area of covariance matrices is that if we observe i.i.d. data 
vectors Xj, with Xi = T,^ 2 Yi, where S p is a positive semi-definite matrix 
and Yi is a vector with i.i.d entries, under weak moment conditions on Y{ 
and assuming that the spectral distribution of T, p has a limit (in the sense of 
weak convergence of distributions) , F n converges to a nonrandom measure 
which we call F. 



SPECTRUM OF KERNEL RANDOM MATRICES 



7 



We call the models Xi = E p Yi the "standard" models of random matrix 
theory because most results have been derived under these assumptions. In 
particular, various results [5, 6, 21] show, among many other things, that 
when the entries of the vector Y have 4 (absolute) moments, the largest 
eigenvalues of the sample covariance matrix X'X/n, where Xi now occupies 
the ith row of the nx p matrix X, stay close to the endpoint of the support 
of F. 

A natural question is therefore to try to characterize F. Except in par- 
ticular situations, it is difficult to do so explicitly. However, it is possible to 
characterize a certain transformation of F. The tool of choice in this context 
is the Stieltjes transform of a distribution. It is a function defined on C + by 
the formula, if we call Stj? the Stieltjes transform of F, 

St F (z)= [ ^f^-, Im[z] >0. 
J A - z 

In particular for empirical spectral distributions, we see that, if F n is the 
spectral distribution of the matrix M n , 

1 n 1 1 

St Fn (z) = - V = - trace((M n - zld)" 1 ). 

n Li — z n 

i=l 

The importance of the Stieltjes transform in the context of random matrix 
theory stems from two facts: on the one hand, it is connected fairly explic- 
itly to the matrices that are being analyzed; on the other hand, pointwise 
convergence of Stieltjes transform implies weak convergence of distributions, 
if a certain mass preservation condition is satisfied. This is how a number 
of bulk results are therefore proved. For a clear and self-contained introduc- 
tion to the connection between Stieltjes transforms and weak convergence 
of probability measures, we refer the reader to [22]. 

The result of [30], later generalized by [37] for standard random matrix 
models with nondiagonal covariance, and more recently by [19], away from 
those standard models, is a functional characterization of the limit F. If 
one calls w n (z) the Stieltjes transform of the empirical spectral distribution 
of XX' jn, w n (z) converges pointwise (and almost surely after [37]) to a 
nonrandom w(z) which, as a function, is a Stieltjes transform. Moreover, w, 
the Stieltjes transform of F, satisfies the equation, if p/n — > p, p > 0, 

1 _ r XdH(X) 

w(z) ^ J 1 + Xw 

where H is the limiting spectral distribution of S p , assuming that such 
a distribution exists. We note that [37] proved the result under a second 
moment condition on the entries of Yi. 
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From this result, [30] derived that in the case where T, p = Id, and hence 
dH = 5±, the empirical spectral distribution has a limit whose density is, if 

f p { X ) = ^^ x){x - a \ 
Jpy ' 2tt P x 

where a = (1 — p 1 ^ 2 ) 2 and b = (1 + p 1 / 2 ) 2 . The difference between the popu- 
lation spectral distribution (a point mass at 1, of mass 1) and the limit of 
the empirical spectral distribution is quite striking. 

2.1.2. Largest eigenvalues results. Another line of work has been focused 
on the behavior of extreme eigenvalues of sample covariance matrices. In par- 
ticular, [21] showed, under some moment conditions, that when E p = Id p , 
li(X'X/n) — > (1 + \fp/n) 2 almost surely. In other words, the largest eigen- 
value stays close to the endpoint of the limiting spectral distribution of 
X'X/n. This result was later generalized in [45], and was shown to be true 
under the assumption of finite fourth moment only, for data with mean 0. 
In recent years, fluctuation results have been obtained for this largest eigen- 
value which is of practical interest in Principal Components Analysis (PCA). 
Under Gaussian assumptions, [16] and [27] (see also [20] and [26]) showed 
that the fluctuations of the largest eigenvalue are Tracy- Widom distributed. 
For the general covariance case, similar results, as well as localization infor- 
mation, were recently obtained in [17]. We note that the localization infor- 
mation (i.e., a formula) that was discovered in this latter paper was shown 
to hold for a wide variety of standard random matrix models through appeal 
to [5]. We refer the interested reader to Fact 2 in [17] for more information. 
Interesting work has also been done on so-called "spiked" models where a 
few population eigenvalues are separated from the bulk of them. In partic- 
ular, in the case where all population eigenvalues are equal, except for one 
that is significantly larger (see [7] for the discovery of an interesting phase 
transition), [31] showed, in the Gaussian case, inconsistency of the largest 
sample eigenvalue, as well as the fact that the angle between the popula- 
tion and sample principal eigenvectors is bounded away from 0. Paul [31] 
also obtained fluctuation information about the largest empirical eigenvalue. 
Finally, we note that the same inconsistency of eigenvalue result was also 
obtained in [8], beyond the Gaussian case. 

2.1.3. Notation. Let us now define some notation and add some clarifi- 
cations. 

We denote by A' the transpose of A. The matrices we will be working 
with all have real entries. We remind the reader that if A and B are two 
rectangular matrices, AB and BA have the same eigenvalues, except for 
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possibly, a certain number of zeros. We will make repeated use of this fact, 
for example, for matrices like X'X and XX' . In the case where A and B 
are both square, AB and BA have exactly the same eigenvalues. 

We will also need various norms on matrices. We will use the so-called 
operator norm, which we denote by ||| -A|||2 which corresponds to the largest 
singular value of A, that is, maxj y/li(A'A). We occasionally denote the 
largest singular value of A by <j\(A). Clearly, for positive semi-definite ma- 
trices, the largest singular value is equal to the largest eigenvalue. Finally, 
we will sometimes need to use the Frobenius (or Hilbert-Schmidt) norm of 
a matrix A. We denote it by ||j4||f- By definition, it is simply, because we 
are working with matrices with real entries, 



Further, we use o to denote the Hadamard (i.e., entrywise) product of two 
matrices. We denote by /j, m the mth moment of a random variable. Note that 
by a slight abuse of notation, we might also use the same notation to refer to 
the mth absolute moment (i.e., £/|X| m ) of a random variable, but if there is 
any ambiguity, we will naturally make precise which definition we are using. 

Finally, in the discussion of standard random matrix models that follows, 
there will be arrays of random variables and a.s. convergence. We work 
with random variables defined on a common probability space. To each 
lo corresponds an infinite-dimensional array of numbers. Unless otherwise 
noted, the n x p matrices we will use in what follows are the "upper-left" 
corner of this array. 

We now turn to the study of kernel random matrices. We will show that 
we can approximate them by matrices that are closely connected to sample 
covariance matrices in high-dimensions and, therefore, that a number of the 
results we just reviewed also apply to them. 

2.2. Inner-product kernel matrices: f{X[Xj/p). 

Theorem 2.1 (Spectrum of inner product kernel random matrices). Let 
us assume that we observe n i.i.d. random vectors, Xi in W. Let us consider 
the kernel matrix M with entries 



We assume that: 

(a) nxp, that is, n/p andp/n remain bounded asp— >oo. 

(b) S p is a positive s emi- definite p x p matrix, and |||Sp|||2 = ci(S p ) remains 
bounded in p, that is, there exists K > 0, such that ai(S p ) < K, for all 
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(c) T, p /p has a finite limit, that is, there exists I 6 R such that limp_ s . 0O trace(X p ) / 
p = I. 

(d) X i = T}J 2 Y i . 

(e) The entries ofYi, a p- dimensional random vector, are i.i.d. Also, denot- 
ing byYi(k) the kth entry ofYi, we assume thatE(Yi(k)) = 0, v&i(Yi(k)) = 
1 and E(|Yj(£;)| 4+£ ) < oo for some e > 0. (We say that Yi has 4 + e ab- 
solute moments.) 

(f) f is a C 1 function in a neighborhood of I = lim p _> 00 trace (E p )/p and a 
C 3 function in a neighborhood o/O. 

Under these assumptions, the kernel matrix M can (in probability) be 
approximated consistently in operator norm, when p and n tend to oo, by 
the matrix K , where 

( „ trace(££)\ , , XX' 

K = (/(0) + f"(0) ^ p> \ 11' + /'(0)— + v p U n , 

where 

v p J p 

In other words, 

|||M — -fir|||2 — > in probability, when p— > oo. 

The advantages of obtaining an operator norm consistent estimator are 
many. We list some here: 

• Asymptotically, M and K have the same j'-largest eigenvalue, for any 
j; this is simply because for symmetric matrices, if lj is the jth largest 
eigenvalue of a matrix, Weyl's inequality (see, e.g., Corollary III. 2. 6 in 
[10]) implies that 

\lj(M) -lj(K)\ < \\\M-K\\\ 2 . 

Hence our result implies that \h(M) — h(K) \ —> in probability as p and 
n go to infinity. 

• The limiting spectral distributions of M and K (if they exist) are the 
same. This is a consequence of Lemma 2.1, page 31 below. So in partic- 
ular, when K has a limiting spectral distribution (in the sense of weak 
convergence of probability measures), the empirical spectral distribution 
of M converges to that distribution (in the sense of weak convergence of 
distributions) in probability. 

• We have subspace consistency for eigenspaces corresponding to separated 
eigenvalues. (For a proof, we refer to [18], Corollary 3.) So, when K has 
eigenvalues that stay separated from the bulk of this matrix's eigenvalues, 
then M has in probability the same property, and the angle between the 
corresponding eigenspaces for K and M go to in probability. 
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(Note that the statements we just made assume that both M and K are 
symmetric, which is the case here.) 

The strategy for the proof is the following. According to the results of 
Lemma A. 3, the matrix X-Xj/p has "small" entries off the diagonal, whereas 
on the diagonal, the entries are essentially constant and equal to trace(S p ) /p. 
Hence, it is natural to try to use the 5-method (i.e., do a Taylor expansion) 
entry by entry. By contrast to standard problems in Statistics, the fact that 
we have to perform n 2 of those Taylor expansions means that the second 
order term is not negligible a priori. The proof shows that this approach can 
be carried out rigorously, and that, perhaps surprisingly, the second order 
term is not too complicated to approximate in operator norm. It is also 
shown that the third order term plays essentially no role. 

Before we start the proof, we want to mention that we will drop the 
index p in S p below to avoid cumbersome notation. Let us also note, more 
technically, that an important step of the proof is to show that, when the Yi 's 
have enough moments, they can be treated without much error in spectral 
results has bounded random variables — the bound depending on the number 
of moments, n and p. This then enables us to use concentration results 
for convex Lipschitz functions of independent bounded random variables at 
various important points of the proof and also in Lemma A. 3 whose results 
underly much of the approach taken here. 

Proof of Theorem 2.1. First, let us call 

A trace(S) 

r = . 

V 

Using Taylor expansions, we can rewrite our kernel matrix as 

f(XiXj/p) = /(0) + f'mixj/p + tp-^Xj/p) 2 

+ l^M {X >X J /pf if i , j. 

f(\\Xi\\l/p) = /(r) + /'(£ M )fMjM _ T \ on th e diagonal. 

The proof can be separated in different steps. We will break the kernel 
matrix into a diagonal term and an off-diagonal term. The results of Lemma 
A. 3, after they are shown, will allow us to take care of the diagonal matrix 
at relatively lost cost. So we postpone that part of the analysis to the end 
of the proof and we first focus on the off-diagonal matrix. 

In what follows, we call "second order term" the matrix A with entries, 
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We call "third order term" the matrix B with entries, 

r "f- j) ix' ; x yl ,f\ l:i . 

The "off-diagonal" matrix is the sum A + B. 

(A) Study of the off-diagonal matrix. 

• Truncation and centralization. Following the arguments of Lemma 2.2 
in [45] , we see that because we have assumed that we have 4 + e absolute 
moments, and lixp, the array Y = ^i<j<n,i<j<p is almost surely equal to 
the array Y of same dimensions with 

Y,j = Yij l\ Yitj | < Bp where B p = p 1/2 ~ s and 5 > 0. 

We will therefore carry out the analysis on this Y array. Note that most of 
the results we will rely on require vectors of i.i.d. entries with mean 0. Of 
course, Yij has in general a mean different from 0. In other words, if we call 
\i = E(iij), we need to show that we do not lose anything in operator norm 
by replacing Y^s by Ui's with Ui = Yi — \x\. Note that, as seen in Lemma A. 3, 
by plugging in t = 1/2 — 5 in the notation of this lemma, which corresponds 
to the 4 + e moment assumption here, we have 

M<p~ 3 / 2 - 5 . 

Now let us call S the matrix XX' /p, except that its diagonal is replaced 
by zeros. From [45], and the fact that n/p stays bounded, we know that 
|||XX'/p|||2 < oiC^OIII^^'llb/p stays bounded. Using Lemma A. 3, we see 
that the diagonal of XX' jp stays bounded a.s. in operator norm. Therefore, 
1 1| S 1 1| 2 is bounded a.s. 

Now, as in the proof of Lemma A. 3, we have 

Si,j = — - + A* 1 + +A» l + Ri,j a.s. 

p V p p J p p 

Note that this equality is true a.s. only because it involves replacing Y by 
Y. The proof of Lemma A. 3 shows that 

\Ri,j\ < /i2a 1 1/2 (E)(a 1 1/2 (S) +p~ S/2 ) +/iVi(S) a.s. 

We conclude that, for some constant C, 

Il-Rlll < Cn 2 n 2 < Cn 2 p- 3 - 2S -»• a.s. 

Therefore |||-R|||2 - > a.s. In other words, if we call Sjj the matrix with i,j 
entry U'jTXJj/p off the diagonal and on the diagonal, 

HIS 1 — Su\h — > a.s. 
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Now it is a standard result on Hadamard products (see for instance, [10], 
Problem 1.6.13, or [25], Theorems 5.5.1 and 5.5.15) that for two matrices A 
and B, \\\Ao B\\\2 < |||j4|||2|||-B|||2- Since the Hadamard product is commuta- 
tive, we have 

S o S — Sir o Su = (S + Sjj) (S — Su). 

We conclude that 

\\\SoS - Su o Su\\\ 2 < \\\S - Su\h(\\\S\h + \\\Su\h) a.s., 

since |||>f? — Su\\\2 — * a.s., and 1 1 1 *S* 1 1 1 2 and hence |||<S[/|||2 stay bounded, a.s. 

The conclusion of this study is that to approximate the second order term 
in operator norm, it is enough to work with Sjj and not S, and hence, very 
importantly, with bounded random variables with zero mean. Further, the 
proof of Lemma A. 3 makes clear that cr^, the variance of the Uij's goes 
to 1, the variance of the ijj's, very fast. So if we can approximate the 
matrix with (i,j)-entry U' i Y,Uj/(p(J u ) consistently in operator norm by a 
matrix whose operator norm is bounded, this same matrix will constitute 
an operator norm approximation of U'^Uj/p. 

In other words, we can assume that, when working with matrices of di- 
mension n x p, the random variables we will be working with have variance 
1 without loss of generality and that they have mean and are bounded by 
Bp, Bp depending on p and going to infinity. 

• Control of the second order term. We now focus on approximating in 
operator norm the matrix with (i,j')th entry, 

{^(X'Xj/pfl^. 

As we just explained, we assume from now on in all the work concerning 
the second order term that the vectors Yj, have mean 0, and that their entries 
have variance 1 and are bounded by B p = p l l 2 ~ & . This is because we just 
saw that replacing Yj by Ui/au would not change (a.s. and asymptotically) 
the operator norm of the matrix to be studied. We note that to make clear 

that the truncation depends on p, we might have wanted to use the notation 

(p) 

Y- , but since there will be no ambiguity in the proof, we chose to use the 
less cumbersome notation Y{. 

The control of the second order term turns out to be the most delicate 
part of the analysis, and the only place where we need the assumption that 
Xi = E 1 / 2 !;. Let us call W the matrix with entries 

, (^') 2 jf^, 
0, ifi = j. 
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Note that when i ^ j , 

E(Wij) = EitmceiXlX jX'jX^/p 2 = E^ce^-X^X,'))// 
= trace(£ 2 )/p 2 . 

Because we assume that trace(X)/p has a finite limit, and n/p stays bounded 
away from 0, we see that the matrix E(W) has a largest eigenvalue that, in 
general, does not go to 0. Note also that under our assumptions, E(Wij) = 
0(l/p). Our aim is to show that W can be approximated in operator norm 
by this constant matrix. So let us consider the matrix W with entries 

(o, if i = j. 

Simple computations show that the expected Frobenius norm squared of 
this matrix does not go to 0. Hence more subtle arguments are needed to 
control its operator norm. We will show that E(trace(W 4 )) goes to zero 
which implies that E(|||VF||||) goes to zero because W is real symmetric. 

The elements contributing to trace(VK 4 ) are generally of the form WijWj f. x 
WkjWij. We are going to study these terms according to how many indices 
are equal to each other. 

(i) Terms involving 4 different indices: i 7^ j 7^ k 7^ /. We first focus on 
the case where all these indices (i,j,k,l) are different. Recall that Xi = 
T}/ 2 Yi, where Y% has i.i.d. entries. We want to compute E(WjjW} k W k 1W1 i), 
so it is natural to focus first on 

E(% j ^,kW k ,iWi, i \Y i ,Y k ). 

Now, note that 

Wij = \{Y!ZYjYjEYi - trace(£ 2 )} 



Hence, calling 



^{YfriYjYj - Id)Sy 4 + trace^O^Y/ - Id))}. 



M 3 ±Y 3 Y>- Id, 



we have 

P 4 Wi,jWj,k = (Y?YlMjT,Y i Y k h EMjT,Yk) + (Y/EMjEF;) trace(S 2 M fc ) 

+ (yfcSM j Sy fc )trace(S 2 M i ) +trace(S 2 M i )trace(S 2 M A .). 
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Now, of course, we have E(Mj) = F,(Mj\Yi,Y k ) = 0. Hence, 

p^(W hj W hk \Y h Y k ) = {YlYK{M j YY i Y k h LM j \Y u Y k )YY k ) 

+ trace(S 2 Mi) trace(£ 2 M fe ). 

If M is a deterministic matrix, we have, since E(YjYj ) = Id, 

E(MjMMj) = E(YjYj MYjYj) - M. 

If we now use Lemma A.l, and, in particular, (4), page 41, we finally have, 
recalling that here a 2 = 1, 

E(MjMMj) = (M + M') + (/i 4 - 3) diag(M) + trace(M) Id —M 

= M' + (/x 4 - 3) diag(M) + trace (M) Id. 

In the case of interest here, we have M = El^Y^E, and the expectation is 
to be understood conditionally on Yi,Y k , but because we have assumed that 
the indices are different and the y m 's are independent, we can do the compu- 
tation of the conditional expectation as if M were deterministic. Therefore, 
we have 

(Y/EE(Mj ZYiY^Mj | Y t , Y k )YY k ) 

= y/spy fc y/s + ( M4 - 3) diag^^s) + (y fc '£ 2 y ) id]sy fc 
= [(y/s 2 y fc ) 2 + ( M4 - 3)y/sdiag(syy^s)sy fc + (y/s 2 y fc ) 2 ]. 

Naturally, we have 'E(Wi t jWj jk \Y i ,Y k ) = 'E(Wf. j iWi ti \Y i , Y k ), and therefore, by 
using properties of conditional expectation, since all the indices are different, 

= E([2(y/s 2 n) 2 + (M4 - 3)y/sdia g (syy fc / s)sn 

+ trace(S 2 M i )trace(S 2 M fc )] 2 ). 

By convexity, we have (a + b + c) 2 < 3(a 2 + 6 2 + c 2 ), so to control the above 
expression, we just need to control the square of each of the terms appearing 
in it. In other words, we need to understand the terms 

r 1 = E((y/£ 2 y fc ) 4 ), 

t 2 = E([y/sdia g (syy fc 's)sy fc ] 2 ) 

and 

T 3 = E([trace(S 2 MOtrace(S 2 M fc )] 2 ). 

Study ofT\. Let us start by the term T\ = E((y/£ 2 Yfc) 4 ). A simple re- 
writing shows that 

(y/S 2 y fc ) 4 = Y^YkY^YiY^YkY^Yi. 
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Using (4) in Lemma A.l, we therefore have, using the fact that £ 2 yy/£ 2 
is symmetric, 

E((y/£ 2 y fc ) 4 |y) 

= y/s 2 [2£ 2 yy/£ 2 + ( M4 - 3) diag(s 2 yy/£ 2 ) 

+ trace(£ 2 yy/£ 2 ) Id]£ 2 y 
= 3(Y/£ 4 y,) 2 + (/i 4 - 3)Y/£ 2 diag(£ 2 y y/£ 2 )£ 2 y . 
Finally, we have, using (5) in Lemma A.l, 

E((F/£ 2 Y fc ) 4 ) = 3[2trace(£ 4 ) + (trace(S 4 )) 2 + (/i 4 - 3)trace(S 4 o S 4 )] 

+ (/x 4 - 3)E(y/s 2 dia g (s 2 y i y/s 2 )s 2 y i ). 

Now we have 

y/s 2 dia g (s 2 y y/s 2 )s 2 y = trace(s 2 yy/s 2 diag(s 2 y y/s 2 )) 

= trace(s 2 yy/s 2 o s 2 yy/£ 2 ). 

Calling V{ = S 2 y, we note that the matrix whose trace is taken is (viv'j) o 
{viv'j) = (vi o o vi) 1 (see [24], page 458 or [25], page 307). Hence, 

y/s 2 dia g (s 2 y y/s 2 )s 2 y = ||^o^|| 2 . 

Now let us call m k the feth column of the matrix S 2 . Using the fact that 
S 2 is symmetric, we see that the fcth entry of the vector Vi is v i(k) = m' k Yi. 
So Vi(k) = Ylmkm' k YiYlmkrn' k Yi. Calling M k = ^k^'ki we see > using (5) in 
Lemma A.l, that 

E(t>i(/c) 4 ) = 2 trace(A4 2 ) + [trace(.Mfc)] 2 + (a*4 - 3) trace(.Mfc o M k ). 

Using the definition of M k , we finally get that 

~E(vi(k) A ) = 3||mjfe||2 + (a«4 - 3)||m fc om k \\\. 

Now, note that if C is a generic matrix and C k is its Arth column, denoting 
by e k the kth vector of the canonical basis, we have C k = Ce k , and hence 
l|Cfcll| = e ' k C'Ce k ^ where 0i(C) is the largest singular value of C. 

So in particular, if we call X\(D) the largest eigenvalue of a positive semi- 
definite matrix D, we have < Ai(S 4 ) ||t77,^ |||. 

After recalling the definition of m k , and using the fact that \\m k o 
mjfclH = ||S 2 o S 2 |||,, we deduce that 

E (IK ° Villi) =3y^||mfc||| + (/i 4 - 3) ^2 W m k m fclli 

< 3Ai(S 4 ) trace(S 4 ) + Qu 4 - 3) trace([S 2 o S 2 ] 2 ). 
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Therefore, we can conclude that 

E((y/£ 2 Y fc ) 4 ) < 3Ai(S 4 ) trace(S 4 ) + (/x 4 - 3) trace([£ 2 o £ 2 ] 2 ). 

Now recall that, according to Theorem 5.5.19 in [25], if C and D are positive 
semidefinite matrices, A(C o D) -< w d(C) o X(D) where A(-D) is the vector 
of decreasingly ordered eigenvalues of D, and d{C) denotes the vector of 
decreasingly ordered diagonal entries of C (because all the matrices are 
positive semidefinite, their eigenvalues are their singular values). Here -< w 
denotes weak (sub)majorization. In our case, of course, C ' = D = S 2 . Using 
the results of Example II.3.5(iii) in [10], with the function (p(x) = x 2 , we see 
that 

trace((S 2 o S 2 ) 2 ) = A 2 (S 2 o S 2 ) < df(S 2 )A 2 (S 2 ) < Ai(S 4 ) trace(S 4 ). 

Finally, we have 

(1) T-y = E((K/S 2 n) 4 ) < (3 + |/x 4 - 3|)A!(S 4 ) trace(S 4 ). 

This bounds the first term, T±, in our upper bound. 

Study 0/T3 . Let us now turn to the third term, T3 = E([trace(£ 2 Mj) trace x 
(S 2 Mfc)] 2 ). We remind the reader that Mj = YjY! — Id. By independence of 
Yi and Y k: it is enough to understand E([trace(£ 2 Aij)] 2 ). Note that 

E([trace(S 2 Mi)] 2 ) = E([Y/£ 2 Y - trace(S 2 )] 2 ) 

= E(y/S 2 y i y/S 2 y i ) - trace(£ 2 ) 2 . 

Using (5) in Lemma A.l, we conclude that 

E([trace(S 2 Mi)] 2 ) = 2trace(S 4 ) + (> 4 - 3)trace(S 2 o S 2 ). 

Using the fact that we know the diagonal of S 2 o S 2 , we conclude that 

T 3 = E([trace(S 2 Mi)] 2 [trace(S 2 M fc )] 2 ) 

(2) 

< {2trace(£ 4 ) + |/x 4 - 3|Ai(I! 2 ) trace(S 2 )} 2 . 
So we have an upper bound on T3. 

Study 0/T2 . Finally, let us turn to the middle term, T2 = E([Y/E diag(Sy Y^ x 
E)EYfe] 2 ). Before we square it, the argument of the expectation has the form 
y/£diag(£YfeY/£)£Yfe. Call u k = TY k . Making the same computations as 
above, we find that 

r/£diag(£Y fe y/£)£y fe = trace(diag(£y fe y/£)£y fe y/£) 

= trace((£Y fe Y/E) o {TY k Yfr)) 

= trace((ufcn^) o (n^u-)) = trace((ufc o Uf~)(ui o Ui)') 

= {UiOUi)'(u k OU k ). 
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We deduce, using independence and elementary properties of inner products, 
that 

E([y/Sdiag(Sy fc y/S)SY fc ] 2 ) < E(||m o Ui|||)E(|K o u k \\ 2 2 ). 

Note that to arrive at (1), we studied expressions similar to E(||ttj o tti|||)- 
So we can similarly conclude that 

T 2 = E([y/Sdiag(Sy fc y/S)Sy fc ] 2 ) < {(3 + \ H - 3|)A!(£ 2 ) trace(£ 2 )} 2 . 

(3) 

With our assumptions, the terms (1), (2) and (3) are 0(p 2 ). Note that in 
the computation of the trace, there are 0(n 4 ) such terms. Finally, note 
that the expectation of interest to us corresponds to the sum of the three 
quadratic terms divided by p s . So the total contribution of these terms is 
in expectation 0(p~ ). This takes care of the contribution of the terms 
involving four different indices, as it shows that 




(ii) Terms involving three different indices: i^ j ^ k. Note that because 
Wi s i = 0, terms involving 3 different indices with a nonzero contribution are 
necessarily of the form (W^j) (Wjfc) , since terms with a cycle of length 3 

all involve a term of the form Wi t i and hence contribute 0. Let us now fo- 
cus on those terms, assuming that j 7^ k. Note that we have 0(n 3 ) such 
terms and that it is enough to focus on the WfjWf k , since the contri- 
bution of the other terms is, in expectation, of order 1/p 4 [with our as- 
sumptions trace(£ 2 )/p 2 = 0(1 /p)], and because we have only n 3 terms in 
the sum, this extra contribution is asymptotically zero. Now, we clearly 
have ~Ei(WfjWf k \Yi) = [E(Wjj|li)] 2 , by conditional independence of the two 
terms. The computation of E(W i 2 J |y) is similar to the ones we have made 
above, and we have 

P A E(w^\Yi) = 2(y/s 2 y) 2 + (p A - 3)y/s dia g (sy y/s)sy 

+ (trace(Syy/S)) 2 . 

Using the fact that /Q = Syy/S is positive semidefinite, and hence its di- 
agonal entries are nonnegative, we have trace(/Q o JC{) < (trace(ZCj)) 2 , and 
we conclude that 

p 4 E(W^\Yi) < (3 + |k 4 - 3|)(y/£ 2 y) 2 < (3 + \k a - 3|)ai(S) 4 ||y \\ A 2 . 

Hence, 

V(W? tj W* k ) < 1(3 + |«4 - 3|)V(S) 8 ||y in. 
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Now, the application F which takes a vector and returns its Euclidean norm 
is trivially a convex 1-Lipschitz function, with respect to Euclidean norm. 
Because the entries of Y{ are bounded by B p , we see that, according to 
Corollary 4.10 in [29], F{Yf) = \\Yi\\2 satisfies a concentration inequality, 
namely, for r > 0, P(|||li||2 — m F \ > r) < 4exp(— r 2 /16B 2 ) where m F is a 
median of F{Yf) = \\Yi\\2 (hence m F is a deterministic quantity). A simple 
integration (see, for instance, the proof of Proposition 1.9 in [29], and change 
the power from 2 to 8) then shows that 

E(\\\Y l \\ 2 -m F \ 8 ) = 0(B 8 ). 

Now we know, according to Proposition 1.9 in [29], that if [i F is the mean 
of F(Yi), that is, fip = E(||Yj||2), fJ, F exists and \mp — fj, F \ = 0(B P ). Since 
fJ-p < A*f 2 = E(ll^illi) = P-> we conclude that, if C denotes a generic constant 
that may change from display to display, 

E(]|^||l) < EdUYlla - m F + m F \ 8 ) < 2 7 (E(| ||1-|| 2 - m F \ 8 ) + m|) 

< C(E(| H^lla - m F \ 8 ) + \m F - fi F \ 8 + fi%) < C(B 8 +p A ). 

Now our original assumption about the number of absolute moments of the 
random variables of interest imply that B p = 0(p l l 2 ~ & ). Consequently, 

E(||K t ||l) = 0(p 4 ). 

Therefore, 

E(W; 2 Xfc) = °(^" 4 ) 

and 

Hence, we also have 

E E E(^.^ 2 fc ) = o( P - 1 ). 

(hi) Terms involving two different indices: i^ j. The last terms we have 
to focus on to control E(trace(VK 4 )) are of the form W^ -. Note that we have 
n 2 terms like this. Since by convexity, (a + 6) 4 < 8(a 4 + b 4 ), we see that it 
is enough to understand the contribution of Wfj to show that J2i j ^(W 4 ^) 
tends to zero. Now, let us call for a moment v = TYi and u = Yj. The quantity 
of interest to us is basically of the form E,((u'v) 8 ). Let us do computations 
conditional on v. We note that since the entries of u are independent and 
have mean 0, in the expansion of (u'v) 8 , the only terms that will contribute 
a nonzero quantity to the expectation have entries of u raised to a power 
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greater than 2. We can decompose the sum representing ~E((u'v) & \v) into 
subterms, according to what powers of the terms are involved. There are 6 
terms: (2,2,2,2) (i.e., all terms are raised to the power 2), (3,3,2) (i.e., two 
terms are raised to the power 3, and one to the power 2), (4,2,2), (4,4), 
(5,3), (6,2) and (8). For instance the subterm corresponding to (2,2,2,2) 
is, before taking expectations, 

After taking expectations conditional on v, we see that it is obviously non- 
negative and contributes 

(^ 2 ) 4 E k<w*j 2 < (E^ 2 ) 4 = W 532 *) 4 < ^fm\i 

Note that we just saw that E(||Yj|||) = 0(p 4 ) in our context. Similarly, the 
term (3,3,2) will contribute 

& 2 E v l v l v l- 

h 7^2 7^3 

In absolute value, this term is less than 

^ 2 (En 3 ) 2 (E4 

Now, note that if z is such that \\z\\2 = 1, we have, for p > 2, ^ \zi\ p <^zf = 
1. Applied to z = H2, we conclude that Yl \ vi\ p < \\v\\?,- Consequently, the 
term (3,3,2) contributes in absolute value less than 

2 2 1 1 1 1 X 

A*3 ff 1Mb- 

The same analysis can be repeated for all the other terms which are all found 
to be less than H^Hf times the moments of u involved. Because we have 
assumed that our original random variables had 4 + £ absolute moments, 
the moments of order less than 4 cause no problem. The moments of order 
higher than 4, say 4 + k, can be bounded by n^Bp. Consequently, we see 
that 

E(W&) = E(E«.|y,)) < CBfif^Pj = 0(B> 4 ) = 0(p-( 2 + 45 )). 

Since we have n 2 such terms, we see that 

E E (Wij)^0 asp ^00. 

Using our earlier convexity remark, we finally conclude that 
EE(Wj)^0 asp^oo. 
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(iv) Second order term: combining all the elements. We have therefore 
established control of the second order term and seen that the largest sin- 
gular value of W goes to in probability, using Chebyshev's inequality. 
Note that we have also shown that the operator norm of W is bounded in 
probability and that 



w _ traced) 



—7-0 in probability. 

2 



• Control of the third order term. We note that the third order term is 
of the form j ) — ^ Wi > j . According to Lemma A. 5, if M is a real 

symmetric matrix with nonnegative entries, and E is a symmetric matrix 
such that maxjj|.Ejj| = then 

eri(£oM) < Qai(M). 

Note that W is real symmetric matrix with nonnegative entries. So all 
we have to show to prove that the third order term goes to zero in operator 
norm is that maxj^j|X-X,/p| goes to because we have just established 
that |||W||| 2 remains bounded in probability. We are going to make use of 
Lemma A. 3, page 45 in the Appendix. In our setting, we have B p = p 1 / 2 ~ s , 
or 2/m = 1/2 — 5. The lemma implies, for instance, that 

m&xlX'iXj /p\ < p~ s log(p) a.s. 

So maxj^j | XpTj jp\ — > a.s. Note that this implies that maxj^j|^j| — > a.s. 
Since we have assumed that /( 3 ) exists and is continuous and hence bounded 
in a neighborhood of 0, we conclude that 

max|/( 3 )(e tJ )^^/p| = o(jT 5 / 2 ) a.s. 

If we call E the matrix with entry Eij = /^(^i^X-Xj/p off-the diagonal 
and on the diagonal, we see that E satisfies the conditions put forth in 
our discussion earlier in this section and we conclude that 

HIS o W||| 2 < max|£i A \\\W\\\ 2 = o(p~ 5/2 ) a.s. 

Hence, the operator norm of the third order term goes to almost surely. 
[To maybe clarify our arguments, let us repeat that we analyzed the second 
order term by replacing the 1^'s by, in the notation of the truncation and 
centralization discussion, Ui. Let us call Wu = Su oSjj, again using notation 
introduced in the truncation and centralization discussion. As we saw, ||| W — 
Wl/llk — > a.s., so showing, as we did, that |||W[/|||2 remains bounded (a.s.) 
implies that |||W|||2 does too, and this is the only thing we need in our 
argument showing the control of the third order term.] 
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(B) Control of the diagonal term. The proof here is divided into two 
parts. First, we show that the error term coming from the first order ex- 
pansion of the diagonal is easily controlled. Then we show that the terms 
added when replacing the off-diagonal matrix by XX' /p + trace(£ 2 )/p 2 ll' 
can also be controlled. Recall the notation r = trace(S)/p. 

• Errors induced by diagonal approximation. Note that Lemma A. 3 guar- 
antees that for all i, |£j j — r| < p~ s / 2 , a.s. Because we have assumed that /' 
is continuous and hence bounded in a neighborhood of r, we conclude that 
f'(£i,i) is uniformly bounded in p. Now Lemma A. 3 also guarantees that 



Hence, the diagonal matrix with entries /(H-XiHl/p) can be approximated 
consistently in operator norm by /(r) Id a.s. 

• Errors induced by off- diagonal approximation. When we replace the off- 
diagonal matrix by f'(0)XX'/p + [/(0) + /"(0) trace(£ 2 )/2p 2 ]ll', we add a 
diagonal matrix with entry /(0) + f'(0)\\Xi\\l/p + f"(0) trace(S 2 )/2p 2 
which we need to subtract eventually. We note that < trace(X 2 )/p 2 < 
cr 2 (S) jp — > when 01 (X) remains bounded in p. So this term does not create 
any problem. Now, we just saw that the diagonal matrix with entries H-XiHl/p 
can be consistently approximated in operator norm by (trace(X)/p) Id. So 
the diagonal matrix with (i, i) entry f(0) + f'{0)\\Xi\\l/p+ f"{0) trace(S 2 )/2p 2 
can be approximated consistently in operator norm by (/(0) + /'(0) trace x 
(S)/p)Id a.s. 

This finishes the proof. □ 

2.3. Kernel random matrices of the type f(\\Xi — XjW^/p). As is to be 
expected, the properties of such matrices can be deduced from the study of 
inner product kernel matrices, with a little bit of extra work. We need to 
slightly modify the distributional assumptions under which we work, and 
consider the case where we have 5 + e absolute moments for the entries of 
Yi . We also need to assume that / is regular is the neighborhood of different 
points. Otherwise, the assumptions are the same as that of Theorem 2.1. 
We have the following theorem: 

Theorem 2.2 (Spectrum of Euclidean distance kernel matrices). Con- 
sider the n x n kernel matrix M with entries 



max 




t <p 



a.s. 



P 
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Let us call 



T = 2 



trace (S) 
P 



Let us call ip the vector with ith entry ipi = H-^QlH/p - trace(X)/p. Suppose 
that the assumptions of Theorem 2.1 hold, but that conditions (e) and (f) 
are replaced by: 

(e') The entries ofYi, a p- dimensional random vector, are i.i.d. Also, de- 
noting by Yi(k) the kth entry of Yi, we assume that E(Yj(fc)) = 0, 
var(Yj(fc)) = 1 and E(\Yi(k)\ 5+e ) < oo for some e > 0. (We say that 
Yi has 5 + e absolute moments.) 

(f) / is C 3 in a neighborhood of t. 

Then M can be approximated consistently in operator norm ( and in prob- 
ability) by the matrix K , defined by 

. , XX'' 

lip ' + ipl ' -2 



K = f(r)ll' + f'(T 
f"(r) 



+ 



1(V> o r/,)' + o V)l' + + 4 trace ( £2 ) n t 



P 



+ v p Id, 



v P = f(0)+rf'(T)-f(r). 
Ln other words, 

\\\M — K\\\2 — > in probability. 

Proof. Note that here the diagonal is just /(0)ld and it will cause no 
trouble. The work, therefore, focuses on the off-diagonal matrix. In what 
follows, we call r = 2 tmc ^ S ^ . Let us define 



.4 



'■j 



I Y 1 1 2 II v 1 1 2 
KMllj _|_ II A?ll2 _ r 



and 



P 



'i,3 



P 



p 



With these notation, we have, off the diagonal, that is, when i^ j, by a 
Taylor expansion, 



M, 



- f(r) + [Aij - 2Sij]f'(T) + \[A^ - 2S M f/"(r) 
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We note that the matrix A with entries Aij is a rank 2 matrix. As a matter 
of fact, it can be written, if is the vector with entries ipi = - — — — r/2, 
A = lip' + ipV '■ Using the well-known identity (see, e.g., [23], Chapter 1, 
Theorem 3.2), 

det(I + uv' + vu') = det( 1 + i V ) ; 

V \\ V \\2 1 + WV J 

we see immediately that the nonzero eigenvalues of A are 

ity±V^IMl2. 

After these preliminary remarks, we are ready to start the proof per se. 

• Truncation and centralization. Since we assume 5 + e absolute mo- 
ments, we see, using Lemma 2.2 in [45], that we can truncate the Y^'s at 
level Bp = p 2 / 5_<5 with 8 > and a.s. not change the data matrix. We then 
need to centralize the vectors truncated at p 2 / 5-5 . Note that because we work 
with Xi — Xj = S 1 / 2 (li — Yj), centralization creates absolutely no problem 
here since it is absorbed in the difference. So in what follows we can assume 
without loss of generality that we are working with vectors Xi = Y^I 2 Yi 
where the entries of Y\ are bounded by p 2 / 5 ' 5 and E(li) = 0. The issue of 
variance 1 is addressed as before, so we can assume that the entries of Y{ 
have variance 1. 

• Concentration of ||Aj — XjW^/p. By plugging in the results of Corol- 
lary A. 2, with 2/m = 2/5 — 5, we get that 

\ x i- x j\\l 2 trace(E) 



max 



p p 

Also, using the result of Lemma A. 3, we have 



< log(p)p" 



-1/10-5 



max | | = max 



Xi\\r, trace(S) 



< \og(p)p 



-1/10-5 



p p 

Note that, as explained in the proof of Lemma A. 3, these results are true 
whether we work with Y{ or their truncated and centralized version. 

• Control of the second order term. The second order term is the matrix 
with (i,j)-entry 

l^js/ ( r )(^-*ii — ■ 

Let us call T the matrix with on the diagonal and off-diagonal entries 
Tij = (Aij — 2Sij) 2 . In other words, 
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We simply write (Aij - 2Sij) 2 = A? • - AAijSij + 45!^-. In the notation 
of the proof of Theorem 2.1, the matrix with entries Sfj off the diagonal 
and on the diagonal is what we called W. We have already shown that 



w _ traced) 
pZ 



— ^ in probability. 

2 



Now, let us focus on the term AijSij. Let us call H the matrix with 

= (1 — $i,j)AijSij. 

Let us denote by S the matrix with off-diagonal entries S%j and on the 
diagonal. If we call S = XX' /p, we have 

S = S-diag(S). 

Now note that Aij =ipi + ipj. Therefore, we have, if diag(V>) is the diagonal 
matrix with entry ipi, 

H = Sdiag(ip) + diag(ip)S. 

We just saw that under our assumptions, maxj|^j| — > a.s. Because for any 
nxn matrices L±, L2, ||| Z/1Z/2 IH2 < |||-^i|||2|||-^2|||2, we see that to show that 
|||-ff|||2 goes to 0, we just need to show that |||5|||2 remains bounded. 

Now we clearly have, HIS*!!^ < lll^llklll Y'Y/pH^. We know from [45] that 
|||Y' Y/p|||2 — > c 2 (l + \/n/p) 2 i a.s. Under our assumptions on n and p, this 
is bounded. Now 

diag(S) = diag(V0 + faaCe(S) Id, 

p 

so our concentration results once again imply that ||| diag(S") 1 1| 2 < trace(S) /p+ 
rj a.s., for any rj > 0. Because ||| • 1 1| 2 is subadditive, we finally conclude that 

\\\S\\\2 is bounded a.s. 

Therefore, 

HI-HIII2 -> a.s. 

Putting together all these results, we see that we have shown that 

2^ 

in probability. 



T — (AoA — diag(A o A)) - 4 trace ( E > r u > _ Id) 

pZ 



• Control of the third order term. The third order term is the matrix L 
with on the diagonal and off-diagonal entries 

r = / (3) te,j) ( \\Xj - Xjg - 2trace(S) \ 3 ^ 
M 6 ^ p J 
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where T was the matrix investigated in the control of the second order term. 
On the other hand, E is the matrix with entries 

E - = (l-S- / (3) ^) ^ ll^-^lll-2trace(S) 
1,1 y h3) 6 V V 
We have already seen that through concentration, we have 
\Xi-Xj\\l 2trace(£) 



max 

irt P V 

This naturally implies that 

2trace(S) 



max 



fa p 



< \og{p)p- 1/w - & a.s. 



< log{p)p- 1/w - 5 a.s 



So if /( 3 ) is bounded in a neighborhood of r, we see that with high-probability 

so is maxj^j 

|/ (3) (fij)l- Therefore, 



max|£/jj| < Klog(p)p 



-1/10-5 



We are now in position to apply the Hadamard product argument (see 
Lemma A. 5) we used for the control of the third order term in the proof of 
Theorem 2.1. To show that the third order term tends in operator norm to 
0, we hence just need to show that |||T|||2 remains small compared to the 
bound we just gave on maxij\Eij\. Of course, this is equivalent to showing 
that the matrix that approximates T has the same property in operator 
norm. 

Clearly, because 0i(£) stays bounded, trace(S 2 )/p stays bounded and so 
does |||trace(S 2 )/p 2 (ll / — Id)|||2. So we just have to focus on Ao A — diag(^4o 
A). Recall that A^^ = 2(|| Ajll 2 ,/^ — trace(S)/p), and so A^i = 2ipi. We have 
already seen that our concentration arguments imply that maxj|^>j| — > a.s. 
So |||diag(yl o ^4) ||| 2 = maxj^ 2 goes to a.s. Now, 

A = lift + if)!', 

and hence, elementary Hadamard product computations [relying on ab' o 
uv' = (a o u) (b o v)'] give 

A o A = 1(V> o ip)' + 2W + (if) o if>)l'. 

Therefore, 

IP o A||| 2 < 2(Vn||^ o ^|| 2 + l^lll). 
Using Lemma A.l, and in particular equation (5), we see that 
„, , 9 , A trace(E 2 ) , 4 .trace(SoS) 

pZ pZ 
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and therefore, E(||-i/;|||) remains bounded. On the other hand, using Lemma 
2.7 of [5], we see that if we have 5 + e absolute moments, 

E(tf) < c( ^ 4tra p C f 2))2 

Now recall that we can take B p =p 2//5_<5 . Therefore nEdl^o^Hl) is, at most, 
of order Bp~ 6 /p. We conclude that 



P(|||AoA||| 2 >Iog(p)VB|-7p)-^0. 
Note that this implies that 



P(|||T||| 2 >log(p)/B p 3 -7p)^0. 

Now, note that the third order term is of the form EoT. Because we have 
assumed that we have 5 + s absolute moments, we have already seen that 
our concentration results imply that 




max|£ p- 1/10 ~ 5 ) 

Using the fact that T has positive entries and therefore (see Lemma A. 5), 
l-So^llb < maxjj \Eij\ |||r|||2, we conclude that with high-probability, 



|£oT||| 2 = 0( (log(p)) 2 W^V I =0((log(p))V 5 ') where 5' > 0. 



P 2 



Hence, 

\\\E o T\\\2 — > in probability. 



• Adjustment of the diagonal. To obtain the compact form of the approx- 
imation announced in the theorem, we need to include diagonal terms that 
are not present in the matrices resulting from the Taylor expansion. Here, 
we show that the corresponding matrices are easily controlled in operator 
norm. 

When we replace the zeroth and first order terms by 

XX 1 ' 



/(t)11' + /'(t) 



lip' + tpl' - 2- 



V 



we add to the diagonal the term /(r) + f , (r)(2ipi 

. trace(S) 
P ' 



2pQ|||/p) = /(r) 



2/'(r) 3 



In the end, we need to subtract it. 
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When we replace the second order term by |/ // (r)[l( , °ip)' + 2t/)tp' + (ipo 

ip)l' + 4 trac °a S - 11/], we add to the diagonal the diagonal matrix with 
entry, 



2/"(r) 



, 9 tracefS 
A + 



2M 



^2 



With our assumptions, maxi^l — > a.s. and trac °( s ) remains bounded, 
so the added diagonal matrix has operator norm converging to a.s. We 
conclude that we do not need to add it to the correction in the diagonal of 
the matrix approximating our kernel matrix. □ 

An interpretation of the proofs of Theorems 2.1 and 2.2 is that they rely on 
a local "multiscale" approximation of the original matrix (i.e., the terms used 
in the entry- wise approximation are all of different order of magnitudes, or at 
different "scales"). However, globally, that is, when looking at eigenvalues of 
the matrices and not just at each of their entries there is a bit of a mixture 
between the scales which creates the difficulties we had to deal with to 
control the second order term. 



2.3.1. A note on the Gaussian Kernel. The Gaussian kernel corresponds 
to f(x) = exp(--yx) in the notation of Theorem 2.2. We would like to discuss 
it a bit more because of its widespread use in applications. 

The result of Theorem 2.2 gives accurate limiting eigenvalue information 
for the case where we renormalize the distances by the dimension which 
seems to be implicitly or explicitly what is often done in practice. 

However, it is possible that information about the nonrenormalized case 
might also be of interest in some situations. Let us assume now that trace(S) 
grows to infinity at least as fast as p 1 / 2 + 2 / m + 5 where 5 > is such that 
1/2 + 2/m + 5 < 1 which is possible since m > 5 + e here. We of course 
still assume that its largest singular value, o"i(S), remains bounded. Then 
Corollary A. 2 guarantees that 

\\Xi-XA\l trace(S) 
mm — > a.s. 

i+j p p 

Hence 

maxexp(— \\Xi — Xj\\ 2 ) < exp(— trace(E)) < 

exp(V /2+2/m+<5 ) a.s. 

Hence, in this case, if M is our kernel matrix with entries exp(— \\Xi — 
Xj Hi), we have 

|||M - Id < nexp(-p 1/2+2/m+s ) a.s., 
and the upper bound tends to zero extremely fast. 
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2.4. More general models. In this subsection, we consider more general 
models than the ones considered above. In particular, we will here focus 
on data models for which the vectors Xi satisfy a so-called dimension-free 
concentration inequality. As was shown in [19], under these conditions, the 
Marcenko-Pastur equation holds (as well as generalized versions of it). Note 
that these models are more general than the one considered above (the 
proofs in the Appendix illustrate why the standard random matrix models 
can be considered as subcases of this more general class of matrices) and 
can describe various interesting objects like vectors with certain log-concave 
distributions or vectors sampled in a uniform manner from certain Rieman- 
nian submanifolds of W endowed with the canonical Riemannian metric 
inherited from W . 

Before we state more precisely the theorem and give examples of distribu- 
tions that satisfy its assumptions, let us give some motivation. One potential 
criticism of Theorems 2.1 and 2.2 is that they deal with data models that are 
inherently quite linear. So a natural question is to understand whether our 
linear approximation result is limited to these linear settings. Also, kernel 
methods are often advocated for their handling of nonlinearities, and though 
the linear case is probably a basic one that needs to be understood, a null 
model of sorts, it is important to be able to go beyond it. As we will soon 
see, the next theorems allow us to get results beyond the linear setting. 

Our generalization of Theorem 2.1 to more general distributions is the 
following. 

Theorem 2.3. Consider a triangular "array" of matrices where each 
row of the array consists of a n x p matrix. We assume these matrices are 
independent. We call Xi, i = 1, . . . , n the rows of this matrix. 

Suppose the vectors {Aj}™ =1 £MP are i.i.d. mean and have the property 
that for any 1-Lipschitz function F (with respect to Euclidean norm), if mp 
is a median of FiXj), 

\ft>0 P(\F(Xi) -m F \>t) <Cexp(-ct 6 ) for some b>0, 

where C is independent of p and c may depend on p but is required to satisfy 
c > p~i 1 / 2 ~ £ ) b / 2 _ is fixed and independent of n and p. 

Consider the n x n kernel random matrix M with Mi j = f(X' i Xj/p). 
Assume that p^n. 

(a) Call £ the covariance matrix of the Xi's and assume that ci(£) stays 
bounded and trace(S)/p has a limit. 

(b) Suppose that f is a real valued function which is C 2 around and C l 
around trace(£)/p. 
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Then the spectrum of this matrix is asymptotically nonrandom and has, a.s., 
the same limiting spectral distribution as that of 

V VI 

M = /(0)ll' + /'(0) +u p Id n> 

p 

where v p = f£*f&) - /(0) - /' (0)^1. 

We note that the term /(O)ll' does not affect the limiting spectral distri- 
bution of M since finite rank perturbations do not have any effect on limiting 
spectral distributions (see, e.g., [3], Lemma 2.2). Therefore, it could be re- 
moved from the approximating matrix, but since it will clearly be present in 
numerical work and simulations, we chose to leave it in our approximation. 
We also note that the limiting distribution of XX' /p under these assump- 
tions has been obtained in [19]. 

Here are a few examples of models satisfying the distributional assump- 
tions stated above. (Unless otherwise noted, b = 2 in all these examples.) 

Examples of distributions for which the previous theorem applies. 

• Gaussian random variables with |||S|||2 bounded and trace(S)/p converges. 
The assumptions of the theorem apply according to [29], Theorem 2.7, 
with c(p) = l/|||S||| 2 . 

• Vectors of the type y/pr where r is uniformly distributed on the unit 
(£2-) sphere is dimension p. Theorem 2.3 in [29] shows that Theorem 2.3 
applies, with c(p) = (1 — l/p)/2, after noticing that a 1-Lipschitz function 
with respect to Euclidean norm is also 1-Lipschitz with respect to the 
geodesic distance on the sphere. 

• Vectors T^fpr with r uniformly distributed on the unit (^2-)sphere in W 
and with IT' = S where £ satisfies the assumptions of the theorem. 

• Vectors with log-concave density of the type e~ u<yX ^ with the Hessian of 
U satisfying, for all x, Hess (£7) > cld p where c > has the characteristics 
of c(p) above (see [29], Theorem 2.7). Here we also need |||£|||2 to satisfy 
the assumptions of the theorem. 

• Vectors r distributed according to a (centered) Gaussian copula, with cor- 
responding correlation matrix £ having |||£||| 2 bounded. Here Theorem 2.3 
applies, since if f has a Gaussian copula distribution, then its ith. entry 
satisfies fj = where v is multivariate normal with covariance matrix 
S, £ being a correlation matrix, that is, its diagonal is 1. Here $ is the cu- 
mulative distribution function of a standard normal distribution which is 
trivially Lipschitz. Now taking r = f — 1/2 gives a centered Gaussian cop- 
ula. The fact that the covariance matrix of r then has bounded operator 
norm requires a bit of work and is shown in the Appendix of [19]. 
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• Vectors sampled uniformly from certain compact connected smooth Rie- 
mannian submanifolds, M, of W p , canonically equipped with the Rieman- 
nian metric g defined by restricting to each tangent space the ambient 
scalar product in W. The curvature properties of these submanifolds need 
to satisfy the assumptions of Theorem 2.4 in [29]. Also, the covariance ma- 
trix £ need to satisfy the assumptions of Theorem 2.3. We note that since 
the length of a curve in M is equal to its length in MP, the same remark 
that we made in the case of the sphere of unit radius applies here, too. 
In particular, a 1-Lipschitz function with respect to Euclidean norm is 
1-Lipschitz with respect to the geodesic distance on the manifold. 

• Vectors of the type p x l h r, 1 < b < 2 where r is uniformly distributed in 
the l-l b ball or sphere in MP. (See [29], Theorem 4.21, which refers to 
[34] as the source of the theorem.) We also refer the reader to [29], pages 
37 and 38 for some of subtleties involved in the definition of the uniform 
distribution on the sphere. Fact A.l applies to them, with c(p) depending 
only on b. Also, the concentration function is of the form exp(— c(b)t b ) 
here. 

We now turn to the proof of Theorem 2.3. The first step in the proof is 
the following lemma. 

Lemma 2.1. Suppose K n is annxn real symmetric matrix, whose spec- 
tral distribution converges weakly (i.e., in distribution) to a limit. Suppose 
M n is an n x n real symmetric matrix. 



1. Suppose M n is such that \\M n — K n \\p = o( v / n). Then M n and K n have 
the same limiting spectral distribution. 

2. Suppose M n is such that |||M n — -Knllh — > 0. Then M n and K n have the 
same limiting spectral distribution. 

Before we prove the lemma, we note that our assumptions imply that the 
limiting spectral distribution of K n is a probability distribution. Therefore, 
to obtain the results of the lemma, we just need to show pointwise con- 
vergence of Stieltjes transforms and then rely on the results of [22], and in 
particular Corollary 1 there. 

Proof of Lemma 2.1. We call St^ n and StM n the Stieltjes transforms 
of the spectral distributions of these two matrices. Suppose z = u + iv. Let 
us call li(M n ) the ith largest eigenvalue of M n . 

Proof of statement 1. We first focus on the Frobenius norm part of the 
lemma. We have 



|Stjr»(*)-Stjif tt (z)| = - 
n 



i 1 

v — 

^li(K n )-z h{M n ) 



< iA |ij(KHi(^ 
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Now, by Holder's inequality, 

n 

J2H M n)-h(K n )\<V^, 



i=l 



Y,\k{M n )-k(K n ) 
\ i=i 



Using Lidskii's theorem [i.e., the fact that, since M n and K n are hermitian, 
the vector with entries li(M n ) — li(K n ) is majorized by the vector li{M n — 
K n )], with, in the notation of [10], Theorem III. 4. 4 <3>(x) = x 2 , we have 

n n 

\k(M n ) - k(K n )\ 2 < ^ 2 (M n - K n ) = \\M n - K n \\ 2 F . 

i=l i=l 

We conclude that 

'\M n -F n \\ F 



\St Kn (z)-St Mn (z)\< 



since \li{K n ) — z\ > \lm[li(K n ) — z]\ =v, and therefore l/\li(K n ) — z\ < 1/v. 
Under the assumptions of the lemma, we therefore have 

\St Kn (z)-St Mn (z)\^0. 

Therefore the Stieltjes transform of the spectral distribution of M n converges 
pointwise to the Stieltjes transform of the limiting spectral distribution of 
K n . Hence, by, e.g., Corollary 1 in [22], the spectral distribution of M n 
converges in distribution to the limiting spectral distribution of K n , which, 
as noted earlier, is a probability distribution by our assumptions. 

Proof of statement 2. Let us now turn to the operator norm part of 
the lemma. By the same computations as above, we have, using Weyl's 
inequality, 



\St Kn (z) - St Mn (z) 



< 



E 



1 



1 



E 



^k{K n )-z k(M n )-z 
k{M n )-k{K n )\ 



< 



i=l 

\M n -K n \\\ 2 



Hence if |||M n , — -ftTnllh - > 0, it is clear that the two Stieljtes transforms are 
asymptotically equal, and the conclusion follows. □ 



We now turn to the proof of the theorem. 
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Proof of Theorem 2.3. For the weaker statement required for the 
proof of Theorem 2.3, we will show that in the 5-method we need to keep 
only the first term of the expansion as long as / has a second derivative that 
is bounded in a neighborhood of 0, and a first derivative that is bounded 
in a neighborhood of trace(E)/p. In other words, we will split the problem 
into two parts: off the diagonal, we write 

/ = /(0) + /'(0)^ + ^ 2 if j; 

V p J p 2 V p ) 

on the diagonal, we write 

'X[Xi trace(S) 



• Control of the off-diagonal error matrix. Here we focus on the matrix W 
with entry 



n^fxixj^ 
p 



The strategy is going to be to control the Frobenius norm of the matrix 

U ifi = j. 

According to Lemma 2.1, it is enough for our needs to show that the Frobe- 
nius norm of this matrix is 0(1/71) a.s. to have the result we wish. Hence, 
the result will be shown, if we can for instance show that 

max Wi j < jT (1 / 2+e) (log(p)) 1+<5 a.s., for some 5 > 0. 
Now Lemma A. 4 or Fact A.l gives, for instance, 

<(pc 2/b (p)y 1/2 [^g{p)] 2/b a.s. 



max 



p 

Therefore, with our assumption on c(p), we have 

maxWij <p- (1/2+£) (log(p)) 4/b a.s. 

Now, ||W||ir < nmaxj j|Wjj|, so we conclude that in this situation, with our 
assumptions that nxp, 

|| W\\f = o(y/n) a.s. 

Now let us focus on 

Wij = f'%j)Wij, 



N. EL KAROUI 



where £jj is between and X[Xj/p. We just saw that with very high- 
probability, this latter quantity was less (in absolute value) thanp~( 1 / 4+£ / 2 ) x 
(log(p)) 2 / 6 , if c > p~( 1 / 2 ~ £ ) 6 / 2 . Therefore if /" is bounded by K in a neigh- 
borhood of 0, we have, with very high probability that 



\\W\\ F <K\\W\\ F = o{^n). 

• Control of the diagonal matrix. We first note that when we replace the 
off-diagonal matrix by /(O)ll' + f'(0)XX'/p, we add to the diagonal certain 
terms that we need to subtract eventually. 

Hence, our strategy here is to show that we can approximate (in operator 
norm) the diagonal matrix D with entries 

/trace(E)\ + f , (XVU _ _ m &i _ /(0 ) 

V p J Vp p J p 

by Vpldp. To do so, we just have to show that the diagonal error matrix Z, 
with entries 

' X[Xi trace(E)' 



(/'(£*) - /'(o)) 



p p 



goes to zero in operator norm. 

As seen in Lemma A. 4 or Fact A.l, if c > p~0-/ 2 ~ £ )W 2 ^ with very high- 
probability, 

X[Xi trace(E) 



max 

i 



< p -(l/4+ £ /2) (log(p)) 2/6_ 



p p 

If f is continuous and hence bounded around trace ( s ) we therefore see that 
the operator (or spectral) norm of Z satisfies with high-probability 

|||^|||2<^ (1/4+£/2) 0og(p)) 2/6 . 

• Final step. We clearly have 

M - M = W + Z. 

It is also clear that M has a limiting spectral distribution, satisfying, up 
to centering and scaling, the Marcenko-Pastur equation; this was shown in 
[19]. By Lemma 2.1, we see that M and M — Z have the same limiting spec- 
tral distribution, since their difference is Z and |||-Z|||2 —> 0. Using the same 
lemma, we see that M and M — Z have (in probability) the same limiting 
spectral distribution, since their difference is W and we have established 
that the Frobenius norm of this matrix is (in probability) o(y/n). Hence, M 
and M have (in probability) the same limiting spectral distribution. □ 



We finally treat the case of kernel matrices computed from Euclidean 
norms, in this more general distributional setting. 
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Theorem 2.4. Let us call r = 2trace(S)/p where E is the covariance 
matrix of the Xi 's. Suppose that f is a real valued function which is C 2 
around r and C 1 around 0. 

Under the assumptions of Theorem 2.3, the kernel matrix M with (i,j) 
entry 



M. 



f 



\Xi — -Xjin 
p 



has a nonrandom limiting spectral distribution which is the same as that of 
the matrix 

M = f(r)ll'-2f'(r)— + v p Id n , 



P 



where v p = f(0) + Tf'(T)-f(r). 



We note once again that the term /(r)ll' does not affect the limiting 
spectral distribution of M. But we keep it for the same reasons as before. 



Proof of Theorem 2.4. Note that the diagonal term is simply /(0) Id, 
so this term does not create any problem. 

The rest of proof is similar to that of Theorem 2.3. In particular the 
control of the Frobenius norm of the second order term is done in the same 
way, by controlling the maximum of the off-diagonal term, using Corollary 
A. 3 and Fact A.l (and hence Lemma A. 4). 

Therefore, we only need to understand the first order term, in other words, 
the matrix with on the diagonal and off-diagonal entry 



R 



\Xi — Xj\\?, ^_ 
P 

\\XiW2 trace(S) 



P 



P 



+ 



\XjW2 trace(S) 



V 



P 



p 



As in the proof of Theorem 2.2, let us call tjj the vector with ith entry 
^. = M_^E1. Clearly, 



Rij = 5 id ( 10' + - 2 



XX' 



Simple computations show that 
trace(S) 



R-2- 



P 



Id = 1/ + 01' - 2 



XX' 
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Now, obviously, lip' + tpl' is a matrix of rank at most 2. Hence, R has the 
same limiting spectral distribution as 

p p 

since finite rank perturbations do not affect limiting spectral distributions 
(see, for instance, [3], Lemma 2.2). This completes the proof. □ 

2.5. Some consequences of the theorems. In practice, it is often the case 
that slight variant of kernel random matrices are used. In particular, it is 
customary to center the matrices, that is, to transform M so that its row 
sum, or column sum or both are 0. Note that these operations correspond 
to right and/or left multiplication by the matrix H = Id n — ll'/n. 

In these situations, our results still apply; the following fact makes it clear. 

Fact 2.1 (Centered kernel random matrices). Let H be the nxn matrix 
Id n -ll'/n. 

1. If the kernel random matrix M can be approximated consistently in op- 
erator norm by K , then, if a,b€ {0, 1}, 

H a MH b can be approximated consistently in operator norm by H a KH b . 

2. If the kernel random matrix M has the same limiting spectral distribution 
as the matrix K , then, if a,b€ {0, 1} ; 

H a MH b has the same limiting spectral distribution as K. 

A nice consequence of the first point is that the recent hard work on 
localizing the largest eigenvalues of sample covariance matrices (see [8, 17] 
and [31]) can be transferred to kernel random matrices and used to give 
some information about the localization of the largest eigenvalues of HMH, 
for instance. In the case of the results of [17], Fact 2 and the arguments of 
[19], Section 2.3.4, show that it gives exact localization information. In other 
words, we can characterize the a.s. limit of the largest eigenvalue of HMH 
(or HM or MH) fairly explicitly, provided Fact 2 in [17] applies. Finally, 
let us mention the obvious fact that since two square matrices A and B, 
AB and BA have the same eigenvalues, we see that HMH has the same 
eigenvalues as MH and HM because 

Proof of Fact 2.1. The proofs are simple. First note that H is posi- 
tive semi-definite and |||-ff|||2 = 1- Using the submultiplicativity of ||| • |||2, we 
see that 

\\\H a MH b - H a KH b \\\ 2 < \\\M - K||| 2 |||iJ a ||| 2 |||ii" b |||2 = \\\M - K\h. 
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This shows the first point of the fact. 

The second point follows from the fact that H a MH b is a finite rank 
perturbation of M. Hence, using Lemma 2.2 in [3], we see that these two 
matrices have the same limiting spectral distribution, and since, by assump- 
tion, K has the same limiting spectral distribution as M, we have the result 
of the second point. □ 

On Laplacian-like matrices. Finally, we point out a simple consequence 
of our results for Laplacian-like matrices. In light of recent results on man- 
ifold learning (see [9]) where these matrices play a key role, it is natural 
to ask what happens to them in our context. Suppose M is an n x n ker- 
nel random matrix as defined in the previous theorems, and consider the 
(Laplacian-like) matrix L defined by 



Call Dl the diagonal matrix made up of the diagonal elements of L. We 
note that our concentration results (Lemmas A. 3 and A. 4, as well as Fact 
A.l) imply that Dl can be approximated in operator norm by /(0)ld n in 
the scalar product kernel matrix case and by /(2trace(S p )/p) Id n in the 
Euclidean distance kernel matrix case. [It is so because Lj j is an average of 
almost constant (and equal) quantities, so with high-probability Lij cannot 
deviate from this constant value, for that would require that at least one of 
the components of the average deviate from the constant value in question.] 
Hence, there exists 7 P such that \\\Dl — 7pld n |||2 tends to almost surely. 
We also recall that the diagonal of the matrix M can be consistently ap- 
proximated in operator norm by a (finite) multiple of the identity matrix, 
so the diagonal of M/n can be consistently approximated in operator norm 
by 0. Therefore, \\\L + M/n — -D_l|||2 tends to almost surely, and therefore, 
|||L + M/n — 7 P Id n |||2 tends to zero almost surely. In other words, L can be 
consistently approximated in operator norm by 7 p Id n , —M/n. Consequently, 
when we can, as in Theorems 2.1 and 2.2, consistently approximate M in 
operator norm by a linearized version, K, of M, then L can be consistently 
approximated in operator norm by 7 p Id n —K/n, and we can deduce spectral 
properties of L from that of 7 p Id n — K/n. When we know only about the 
limiting spectral distribution of M, as in Theorems 2.3 and 2.4, the operator 
norm consistent approximation of L by 7 p Id n —M/n carries over to give us 
information about the limiting spectral distribution of L since the effect of 
7 P Id n is just to "shift" the eigenvalues by j p . We note that getting informa- 
tion about the eigenvectors of L would require finer work on the properties 
of the matrix Dl since approximating it by a multiple of the identity does 
not give us any information about its eigenvectors. 




38 



N. EL KAROUI 



3. Conclusions. The main result of this paper is that under various tech- 
nical assumptions, in high-dimensions, kernel random matrices [i.e., n x n 
matrices with (i,j)th entry f{X[Xj/p) or f(\\Xi — XjW^/p) where {Xi}f =1 
are i.i.d. random vectors in MP with p — > oo and pxn] which are often used 
to create nonlinear versions of standard statistical methods and essentially 
behave like covariance matrices, that is, linearly, a result that is in sharp 
contrast with the low-dimensional situation where p is assumed to be fixed, 
and where it is known that, under some regularity conditions, spectral prop- 
erties of kernel random matrices mimick those of certain integral operators. 
Under ICA-like assumptions, we were able to get a "strong approximation" 
result (Theorems 2.1 and 2.2), that is, an operator norm consistency re- 
sult that carries information about individual eigenvalues and eigenvectors 
corresponding to separated eigenvalues. Under more general and less linear 
assumptions (Theorems 2.3 and 2.4), we have obtained results concerning 
the limiting spectral distribution of these matrices using a "weak approxi- 
mation" result relying on bounds on Frobenius norms. 

Beside the mathematical results obtained above, this study raises several 
statistical questions, both about the richness — or lack thereof — of models 
that are often studied in random matrix theory and about the effect of 
kernel methods in this context. 

3.1. On kernel random matrices. Our study, motivated in part by nu- 
merical experiments we read about in the interesting [44] , has shown that in 
the asymptotic setting we considered which is generally considered relevant 
for high-dimensional data analysis, the kernel random matrices considered 
here behave essentially like matrices closely connected to sample covariance 
matrices. This is in sharp contrast to the low-dimensional setting where it 
was explained heuristically in [44], and proved rigorously in [28], that the 
eigenvalues of kernel random matrices converged (under certain assump- 
tions) to those of a canonically related operator. 

This suggests that kernel methods could suffer from the same problems 
that affect linear statistical methods, such as Principal Component Anal- 
ysis, in high-dimensions. The practical significance of our result is that in 
high-dimensions, the nonlinear methods that rely on kernel matrices may be 
behaving like their linear counterparts. Our study also permits the transfer 
of some recent random matrix results concerning large-dimensional sample 
covariance matrices to kernel random matrices. We now discuss some possi- 
ble practical settings to highlight when our results are and are not relevant. 

On kernel-PC A. An important motivation for this study was to try 
to understand the properties of kernel-PCA in high-dimensions. We refer 
the reader to [36] pages 48-50 for a primer on kernel-PCA, but let us say 
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that kernel-PCA performs a spectral decomposition of a (row and column- 
centralized) kernel matrix to efficiently perform a nonlinear version of PC A; 
instead of doing standard PCA, the algorithm performs PCA in feature 
space. Our results, and in particular Theorems 2.1 and 2.2 clearly show that 
in high- dimensions, when the assumptions of the theorems are satisfied, the 
algorithm may essentially be performing a linear PCA despite appearances 
to the contrary. By contrast, in low-dimension, results such as [28] show 
that the intuition behind kernel-PCA is correct and that the algorithm then 
performs a genuinely nonlinear PCA. Being aware of the difference between 
the two settings should be helpful to practitioners in that it will inform them 
about the possible limitations of kernel-PCA as a nonlinear method. For an 
example of applications, we refer the reader to, for instance, [44] and [36], 
Chapter 10. We note also that from a slightly more "numerical analysis" 
standpoint, our results basically say that the Nystrom method (see [44] and 
references therein) to approximate eigenfunctions of integral operators can 
be unreliable in high-dimensions. 

On kernel-ICA. The setting of Theorems 2.1 and 2.2 is naturally well- 
suited for applications to ICA-like problems. Since we are dealing with vec- 
tors Xi here, we would be considering multidimensional ICA problems (see 
[2], page 33). For instance, in the formulation of [2], kernel-ICA is solved by 
solving a kernel-CCA problem, that is, a generalized eigenvalue problem with 
kernel matrices as input. We refer the reader to equations (10) and (13) in 
[2] for more details. The results of Theorems 2.1 and 2.2 are directly relevant 
here, since the matrices at stake in these kernel-CCA problems can be ap- 
proximated consistently in operator norm by linear counterparts, and hence 
the solution of the kernel-CCA problem can be consistently approximated 
by the solution of the problem obtained by linearizing the kernel matrices 
at stake, provided the smallest singular value of the linearized version of 
the kernel matrices in question stay bounded away from 0. We note that 
in practice this latter requirement can be checked using our fairly detailed 
knowledge of the properties of extreme eigenvalues of sample covariance ma- 
trices. We note that in this setting, our theorems confirm the predictions in 
[2], page 33, that problems might arise with the algorithm they propose in 
high-dimensions due to slow decay of eigenvalues. 

Geostatistics applications. Certain kernel matrices, corresponding to co- 
variance functions of, for instance, Gaussian processes, also appear in geo- 
statistics and spatial statistics in techniques such as kriging (see, e.g., [15], 
Chapter 3 and, for instance, pages 106-110). For examples of kernels that 
correspond to covariance functions of Gaussian processes, we refer the reader 
to [33], Chapter 4. Naturally, in those sort of applications, the dimension of 
the data vectors is low (at most 3), and therefore "classical results" such as 
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those of [28] apply whereas our results are limited to the high-dimensional 
setting more often encountered in some problems of multivariate statistics, 
machine learning or bioinformatics. 

3.2. Limitations of standard random matrix models. In the study of 
spectral distribution of large-dimensional sample covariance matrices, it has 
been somewhat forcefully advocated that the study should be done under 
the assumptions that the data are of the form X{ = Y}l 2 Yi where the entries 
of Yi have, for instance, finite fourth moment. At first sight, this idea is ap- 
pealing, as it seems to allow a great variety of distributions and hence flexible 
modeling. A possible drawback however, is the assumption that the data are 
linear combinations of i.i.d. random variables or the necessary presence of 
independence in the model. This has however been recently addressed (see, 
e.g., [19]) and it has been shown that one could go beyond models requiring 
independence in a lurking random vector which the data linearly depend on. 

Data analytic consequences. However, a serious limitation is still present. 
As the results of Lemmas A. 3, A. 4 and Fact A.l make clear, under the 
models for which the limiting spectral distribution of the sample covariance 
matrix has been shown to satisfy the Marcenko-Pastur equation, the norms 
of the data vectors are concentrated, and the corresponding data vectors 
are almost orthogonal to one another. In other words, under the "standard" 
ICA-like random matrix models (used in Theorems 1 and 2), that is, the 
random vectors {Aj}™ =1 are i.i.d. in MP, = E 1 / 2 !^ with Yi having i.i.d. 
entries with mean 0, variance 1 and 4 + e absolute moments, we have, as- 
suming that {li}™ =1 are i.i.d., p — >oo, p/n and |||£|||2 remain bounded; the 
vectors {Aj}™ =1 have the property that for a deterministic (and computable) 
sequence c p , we have 

max | llXjllo/p — cJ — > 

l<i<n ' 

and 

max|X-Xj|/p->-0. 

Both these statements hold almost surely. Geometrically, this means that 
the vectors {Xi/^/p}f =l are close to a sphere and almost orthogonal to one 
another. These properties also hold for the more general (and less linear) 
models we considered in Theorems 2.3 and 2.4. 

Hence, if one were to plot a histogram of {H^QIIlA'liLi) this histogram 
would look tightly concentrated around a single value — the spread of this 
histogram being computable from our concentration results (Lemmas A. 3, 
A. 4 and Fact A.l). Though the models appear to be quite rich, the geometry 
that we can perceive by sampling n such vectors, with n x p, is, arguably, 
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relatively poor. These remarks should not be taken as aiming to discredit the 
interesting body of work that has emerged out of the study of such models. 
Their aim is just to warn possible users that in data analysis, a good first 
step would be to plot the histogram of {||^i|||/p}iLi an d check whether 
it is concentrated around a single value. Similarly, one might want to plot 
the histogram of inner products {X[Xj/p} and check that it is concentrated 
around 0. If this is not the case, then insights derived from random matrix 
theoretic studies would likely not be helpful in the data analysis. 

We note, however, that recent random matrix work (see [13, 14, 19, 32]) 
has been concerned with distributions which could be loosely speaking be 
called of "elliptical" type — though they are more general than what is usu- 
ally called elliptical distributions in Statistics. In those settings, the data 
is, for instance, of the form X\ = r^S 1 / 2 !^ where T{ is a real-valued random 
variable, independent of Y{. This allows the data vectors to not approxi- 
mately live on spheres (but does not change anything about angles between 
different vectors) , and is a possible way to address some of the concerns we 
just raised. The characterization of the limiting spectrum gets quite a bit 
more involved than in the "standard" setting, that is, n = l, and the results 
show a lack of robustness to the "indirect" assumption that the data vectors 
live close to a sphere. 

Finally, this geometric discussion applies also to theoretical studies under- 
taken under the assumptions that the Xi are A/"(0, £ p ) and that the problem 
is high dimensional. It should highlight some possibly severe limitations of 
the normality assumption in high- dimensions. 

APPENDIX 

In this appendix, we collect a few useful results that are needed in the 
proof of our theorems, and whose content we thought would be more acces- 
sible if they were separated from the main proofs. 

(A) Some useful results. We have the following elementary facts. 

Lemma A.l. Suppose Y is a vector with i.i.d. entries and mean 0. Call 
its entries yi. Suppose E(y 2 ) = cr 2 and E(y 4 ) = /i4. Then if M is a deter- 
ministic matrix, 

(4) E(YY'MYY') = a 4 (M + M') + (/i 4 - 3a 4 ) diag(M) + a 4 trace(M) Id. 
Further, we have (Y'MY) 2 = trace(Af YY'MYY') and 



E(trace(Myy'Myy')) 

= a 4 trace(M 2 + MM') + cr 4 (trace(M)) 2 + (/i 4 - 3d 4 ) trace(Af o M). 
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Here diag(M) denotes the matrix consisting of the diagonal of the matrix 
M and off the diagonal. The symbol o denotes Hadamard multiplication 
between matrices. 

Proof. Let us call R = YY'MYY'. The proof of the first part is elemen- 
tary and consists merely in writing the (i, j)th entry of the corresponding 
matrix. As a matter of fact, we have 



Using the fact that entries of Y are independent and have mean 0, we see 
that, in the sum, the only terms that will not be in expectation are those 
for which each index appears at least twice. If j, only the terms of the 
form yfyj have this property. So if i ^ j, 



Let us now turn to the diagonal terms. Here again, only the terms y 2 y\ 
matter. So on the diagonal, 



We conclude that 

E(R) = o\M + M') + (/z 4 - 3a 4 ) diag(M) + trace(M) Id. 

The second part of the proof follows from the first result, after we remark 
that, if D is a diagonal and L is general matrix, trace(LD) = trace(L o D), 
from which we conclude that trace(Mdiag(M)) = trace(M o diag(M)) = 
trace(MoM). □ 

Lemma A. 2 (Concentration of quadratic forms). Suppose the vectors Z 
is a vector in W with i.i.d. entries of mean and variance a 2 . Suppose that 
their entries are bounded by B p . Let M be a symmetric matrix, with largest 
singular value o~\(M). Call 



Ri,j = viVj ^ yiVj M i,j = X] yiyjykyi M k,i- 



i,j k,l 



B(R hJ ) = B(yfy](M^ + M jA )) = a 4 (M^ + M jA ). 



E(i?i,i) = mM iti + o- 4 Y^ M i,i = (M4 - o-^M^ + trace(M). 



128exp(47r)o-i(Af)#; 



P 



Then we have, if r/2 > 

/ Z'MZ o trace 




(6) 



< 8exp(4vr)exp(-p(r/2 - ( p ) 2 / (32B 2 {1 + 2v p ) 2 a 1 {M))) 
+ 8exp(4^)exp(-p/(32 J B2(l + 2^) 2 ai(M))). 
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Proof. We can decompose, using the spectral decomposition of M, 
M = M+ — M- where M+ is positive semi-definite and M_ is positive defi- 
nite (or if M is itself positive semi-definite). We can do so by replacing the 
negative eigenvalues of M by in the spectral decomposition and get M_|_ in 
that way. Note that then, the largest singular values of M + and M_ are also 
bounded by a\{M) since o\{M) is absolute value of the largest eigenvalue 
of M in absolute value, and the nonzero eigenvalues of M + are a subset of 
the eigenvalues of M and so are the eigenvalues of M_ when M_ is not 0. 
Now it is clear that the function F which associates to a vector x in W the 
scalar \J x' M + x /p = \\M+ x/y/p\\2 is a convex, \J o\ (M) /p-Lipschitz func- 
tion with respect to Euclidean norm. Calling rap the median of F(Z), we 
have, using Corollary 4.10 in [29], 

P(\F(Z) - m F \ >r)< 4exp(-pr 2 /(16 J B|o-i(M))). 

Let us now call hf the mean of F{Z) (it exists according to Proposition 1.8 
in [29]). Following the arguments given in the proof of this Proposition 1.8, 
and spelling out the constants appearing in the last result of Proposition 1.8 
in [29], we see that 

P(\F(Z) -n F \ >r) <4exp(4vr)exp(-pr 2 /(32^ ( ri(M))). 

(Using the notation of Proposition 1.8 in [29], we picked K2 = 1/2, and 
C = exp(-7rC 2 /4); showing that this is a valid choice just requires one to carry 
out some of the computations mentioned in the proof of that Proposition.) 
Let us call A,B,D the sets 



Z'M+Z 
P 



> r 



D 



'Z'M+Z 



V 



+ Hf < 1 + 2fiF 




(J.F < 1 



and 



D 



'Z'M+Z 



P 



fJLF 



>r/(l + 2 MF ) 



Of course, we have P{A) < P(A n B) + P{B C ). Now note that An BCD, 
simply because for positive reals, a — b/ (y/a + yb) = y/a — Vb. We conclude 
that 

P(A) <4exp(47r)[exp(-pr 2 /(32B 2 (l + 2/i F ) 2 C j 1 (M))) 



+ exp(-p/(32S 2 C 7 1 (M)))]. 
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Let us know call a 2 the variance of the each of the component of Z. We 
know, according to Proposition 1.9 in [29], that 



var(i ? ) 



B(Z'M + Z) 



P 



a 



2 trace(M + ) 



P 



V-F < Cp 



128exp(4vr)(7i(M)S2 



P 



Hence, we conclude that, if r > £ p , 



Z'M + Z 2 trace(M+) 



a 



> r 



P P 
< 4exp(4^) exp(-p(r - C P ) 2 /(32£ 2 (1 + 2 l x F fa 1 {M))) 

+ 4exp(47r)exp(-p/(32^(l + 2 / u F ) 2 a 1 (M))). 

To get the announced result, we note that for the sum of two reals to be 
greater than r in absolute value, one needs to be greater than r/2, and that 
our bounds become conservative when we replace \i f (and its counterpart for 
M_) by is p . [Note that we get conservative bounds when replacing the /2_f's 
by max(E( A /Z'M + Z/p),E( 1 /Z'M_Z/p)), and that this quantity is clearly 
bounded by <7<7i(X).] Hence, we have, as announced, if r/2 > £ p , 



P 



Z'MZ 2 trace(M) 



a 



P 



P 



> r 



< 8exp(4vr) exp(-p(r/2 - ( p ) 2 / {32B 2 (1 + 2/x F )Vi(M))) 

+ 8exp(47r)exp(-p/(32^(l + 2 A i F ) 2 ai(M))). 

Finally, we note that the proof makes clear that the same result would hold 
for different choices of M+ and M_, as long as max(c7i(M+),cJi(M_)) < 
ai(M). □ 

We therefore have the following useful corollary: 



Let Yi and Yj be i.i.d. random vectors as in Lemma 



Corollary A.l. 

A. 2 with variance 1. Suppose that E is a positive semi-definite matrix. We 
have, with 

128exp(4vr)cJi(S) J B2 



P 



and 
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that if r/2 > £ p and K = 8exp(47r), 

> A < Kexp(-p(r/2 - ( P ) 2 /(32B 2 (1 + 2v p fa 1 {T))) 

+ Kexp(-p/(32^(l + 2v p f(j 1 {T))). 

Proof. The proof relies on the results of Lemma A. 2. Remark that, 
since £ is symmetric, 

Now the entries of the vector made by concatenating Yi and Yj are i.i.d. 
and so we fall back into the setting of Lemma A. 2. Finally, here M + and 

M_ are known explicitly. A possible choice is M + = 1/2 and M_ = 

1/2 ^ q . Up is obtained by upper bounding the expectation of the square 

of F in the notation of the proof of the previous lemma for these explicit 
matrices. Note that their largest singular values are both smaller that <ti(£), 
so the results of the previous lemma apply. □ 



P 



P 



Lemma A. 3. Let {Yj}™ =1 be i.i.d. random vectors in W p , whose entries 
are i.i.d., mean 0, variance 1 and have bounded (in p) m > 4 absolute mo- 
ments. Suppose that is a sequence of positive semi-definite matrices 
whose operator norms are uniformly bounded in p and n/p is asymptotically 
bounded. We have, for any given e > 0, 



max 

i,3 



Y-HpYj trace(Sp) 

p P 



< p -l/2 + 2/ m(log(p)) (l +e) /2 



Proof. Throughout the proof, we assume without loss of generality 
that m < oo. 

Call t = 2/m. It is clear that with our moment assumptions, t < 1/2. 
According to Lemma 2.2 in [45], the maximum of the array of {1^,}™ =1 is a.s. 
less than p l . So to control the maximum of the inner products of interest, 
it is enough to control the same^quantity when we replace Y% by Yi with 
Yi.i = li 5 /l|y. A<pt. Now note that Yi satisfies the boundedness assumption of 
Corollary A.l, but its mean is not necessarily zero and its variance is not 1. 
Note however, that all the entries of Yi have the same mean, Ji. Since Yi has 
mean 0, we have 

|/2| < E(|Y 1 , 1 |l| yia | >pt ) <E(|Y 1 , 1 |>- t ( m - 1 )) < W - 2+ *. 

Similarly, if we call a 2 the variance of Y, we have 

a 2 = V(\Y 1}1 \ 2 l ]YlM < pt ) -Jl 2 = 1 - (E(|Y 14 | 2 l| nil | >f)t ) + Jl 2 ). 
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Hence, < 1 — a 2 , and 

i-5 2 = E(|y lil | 2 i| yM | >pt ) + ^ 2 

<E(|Fi ) i| m p-*( m - 2 )) + /I 2 

Let us call Ui = Yi — Jll p and Ui = Ufa where a 2 is also the variance of 
Ui. Corollary A.l applies to the random variables Ui with B p = 2p t when p 
is large enough. So Q p = 0(p 1 ~ 2t ). Let us now call, for some e > 0, 

r{p)=p t - 1 / 2 {\og{p))^l 2 . 

Since, for p large enough, r(p)/2 > Q p , we can apply the conclusions of 
Corollary A.l, and by plugging in the different quantities, we see that 



PdU&pUjM > r(p)) < exp(-K(log(p)) 



where K denotes a generic constant (that may change from display to dis- 
play). In particular, K is independent of p and is hence trivially bounded 
away from as p grows. The bound we just obtained on 1 — a 2 also implies 
that for p large enough, a 2 > 1/2 from which we conclude that for another 
K with the same properties, 

P{\U' j Y, p U j /p\ >r{p)) <exp(-A-(log(p)) 1+e ). 

In other respects, the arguments of Lemma A. 2 show that, since a 2 is the 
variance of Ui, 

P(\U£ p Ui/p-a 2 tT a ce(Z p )/p\>r(p))<eM-K(log(p)) 1+£ ). 

Now 

Y!X p Y j _ U£ p U j (l'X p Uj + ^S P 1) . ~ 2 1%1 

p p p p 



Remark that l'X p l <pcri(£ p ), and |l'E p E/j-| < ^/l'£ p l ^jU'^ p Uy We con- 
clude, using the results obtained in the proof of Lemma A. 2 that with proba- 
bility greater than 1 — exp(— i^(log(p)) 1+e ), the middle term is smaller than 



2\/cri(Sp)(^/cri(Sp) + r(p))|/I|. As a matter of fact, ^JU'-T lp Uj jp is concen- 
trated around its mean which is smaller than ay / trace(S p )/p which is itself 
smaller than \J o\ (S p ) . Now recall that |ju| = 0(p~ 2+t ) = o(r(p)). We can 



therefore conclude that 



P 



Y/T^pYj _ 2 trace(S p 

p p 



> 2r(p) J < 2exp(-i^(log(p)) 1+e ). 
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Now note that < 1 - a 1 = 0(p 



o(r(p)) since t < 1/2 < 3/2. With 



our assumptions, trace(£ p )/p remains bounded, so we have finally 

trace fS 



P 



P 



P 



> 3r(p) < 2exp(-A"(log(p)) 



1+^ 



And therefore, 



P I max 



trace(S ? 



> 3r(p) < 2n 2 exp(-K(log(p)) 



l+e\ 



p p 

Using the Borel-Cantelli lemma, we reach the conclusion that 

trace 



max 

hi 



P 



P 



< 3r(p) = 3p 2/m ~ 1/2 log(p) 



a.s. 



Because the left-hand side is a.s. equal to | Y% S p pYj — 5ij traci ^ s p) 



, we reach the 

announced conclusion but with r(p) replaced by 3r(p). Note that, of course, 
any multiple of r(p), where the constant is independent of p, would work in 
the proof. In particular, by taking r(p) = r(p)/3, we reach the announced 
conclusion. □ 



Corollary A. 2. Under the same assumptions as that of Lemma A. 3, 
y 2 Yi, we also have 

trace 



if we call Xi = S 



max 



x j\\l 



P 



P 



< 



P 



-l/2+2/m 



(log(p)) 



(l+e)/2 



a.s. 



Proof. The proof follows immediately from the results of Lemma A. 3, 
after we write 

|| Xi — -XjUl — 2trace(S p ) 

= [YiT.pYi - trace(Sp)] + [YjUpYj - trace(Ep)] - 2Y i 'T lp Y j . 

Note that as explained in the proof of Lemma A. 3, the constants in front of 
the bounding sequence do not matter, so we can replace 3p~ 1 / 2+2 / m (log(p))^ 1+e ^ 2 
by p _1 / 2+2 / m (log(p))^ 1+£ - ) / 2 , and the result still holds. [In other words, we 
are really using Lemma A. 3 with upper bound p _1 / 2+2 / m (log(p))^ 1+£ ^ 2 /3.] 
□ 

Lemma A. 4. Let {Xi}f =1 be i.i.d. random vectors in M. p whose entries 
are i.i.d., mean 0, having the property that for 1-Lipschitz (with respect to 
Euclidean norm) functions F, if we denote by mp the median of F(Xi), 



P(\F(Xi) - m F \ > r) < C7exp(-c(p)r 2 
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where C is independent of p and c is allowed to vary with p (if it goes to 
zero, we assume it does so like p~ a , < a < 1). Call T, p the covariance 
matrix of X±. Assume that o"i(S p ) remains bounded in p. Then, under the 
triangular array construction of Theorem 2.3, we have, for any e > 0, 



max 



trace(£ ? 



P 



< (pc(p))- 1/2 (log(p)) (1+e)/2 a.s. 



Proof. The proof once again relies on concentration inequalities. First 
note that Proposition 1.11 combined with Proposition 1.7 in [29] shows that 
if Xi and Xj are independent and satisfy concentration inequalities with 
concentration function a(r) (with respect to Euclidean norm), then the vec- 
tor (y ) also satisfies concentration inequalities with concentration function 

2a(r/2) with respect to Euclidean norm in M 2p . (We note that Proposition 
1.11 is proved for the metric on M 2p || • ||2 + || • H2 where each Euclidean 
norm is a norm in MP, but the same proof goes through for Euclidean norm 
on M 2p . Another argument would be to say that the metric || • H2 + || • lb is 
equivalent to the norm of the full M 2p with the constants in the inequalities 
being 1 and y/2 simply because for a, b > 0, V« 2 + b 2 <a + b< V2Va 2 + b 2 .) 

Therefore, the arguments of Lemma A. 2 go through without any problems 
with S p = Id and B 2 = 4/c(p). So a result similar to Corollary A.l holds 
and we can apply the same ideas as in the proof of Lemma A. 3 and get the 
announced result. □ 

Corollary A. 3. Under the assumptions of Lemma A. 4, we have, for 
any e > 0, 

\ x i- x j\\l trace(£ p ) 



max 



p p 



< {pc(p))~ 1/2 (\°E(p)) (1+£)/2 o.s. 



Proof. The proof is an immediate consequence of Lemma A. 4, along 
the same lines as the proof of Corollary A. 2. □ 



Finally, allow the same lines of proof; we have the following fact. 



Fact A.l. Let {Xj}™ =1 be i.i.d. random vectors in MP whose entries 
are i.i.d., mean 0, having the property that for 1-Lipschitz (with respect to 
Euclidean norm) functions F, if we denote by mp the median of F(Xi), 

P(\F(Xi) - m F \ >t)< Cexp(-c(p)i fc ) for some b > 0, 

where C is independent of p and c is allowed to vary with p (if it goes to 
zero, we assume it does so like p~ a , < a < b/2). Call S p the covariance 
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matrix of X\. Assume that <ti(£ p ) remains bounded in p. Then, we have, 
under the triangular array construction of Theorem 2.3, for any e > 0, 

< {pc 2/b (p))- 1/2 (log(p)) (1+£)/b a.s. 

Also, we then have 

< Cpc 2/6 (p)r 1/2 (log(p))( 1+£ )/ 6 a.s. 

The proof of this last fact follows the same step as that of Lemma A. 4, 
with a slight adjustment since we need to replace 2 by b. For a related 
question and more details, we refer the reader to [19]. 

(B) A linear algebraic result. Finally, we finish this appendix with a 
linear algebraic lemma which we need in our approximations and is of inde- 
pendent interest. 

Lemma A. 5. Suppose M is real symmetric matrix with nonnegative en- 
tries. Suppose that E is a real symmetric matrix such that maxij\Eij\ < 
for some ( > 0. Then, if a\{A) is the largest singular value of matrix A and 
if o represents the Hadamard product (i.e., entrywise multiplication of two 
matrices), we have 

ai(EoM) <(ax(M). 

Proof. We first note that EoM is real symmetric. Therefore, 
a^EoM) = lim [trace((£ o M) 2k )] 1/{2k) . 

k— >oo 

Now we claim that 

|trace((£o M) 2k )\ < ( 2k trace(M 2fc ). 
To see this, recall that for a p x p matrix A, 

trace(^4 )= ^ ^ -^11,12^2,13 ' " ' -^i^ii- 

l<h,i2,—,ik<P 

Now, 

\ A. .A A. . I < I A. . II A. . I ... 14. . I 

When A = E o M, Aij = EijMij. Since My > 0, we therefore have \Eij x 
Mi j I < C M i,j- Hence, 

|trace((SoM) fc )| < C^M^M^ ■ ■ ■ M ikM = ( k trace(M fc ). 

l<ii,i 2 ,... ,ik<P 



max 



XjXj trace(Sp) 
p p 



x i- x j\\2 trace(S p ) 



max = ^ - 2 
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So 

[trace((£ o M) 2k )] 1/{2k) < C[trace(M 2fe )] 1 ^ 2fc ). 
Taking limits as k — > oo concludes the proof. □ 
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